Open hilenamin opened 6 years ago
Although it is somewhat of a test issue let me answer it.
Ideally you want to load csv directly into atomese, to not pay the overhead of creating a structure like Table
. However, I feel you would get more done with less code if you simply reuse https://github.com/opencog/as-moses/blob/master/moses/comboreduct/table/table_io.h#L151 then turn a table into atomese.
The completely ideal solution would be reuse as much as you can from loadTable
(and its subroutines), possibly refactoring code when necessary, without recreating a Table
, but since that table is an intermediary structure anyway it's not much overhead so for now just using plain loadTable
should be fine. The ideal solution can be sought after only if it become a performance critical issue.
Nil, How are you representing tables? Again, I want to draw your attention to the module (opencog matrix)
which deals with sparse tables. As long as you represent your data as one of these:
(SomeLink (SomeRowAtom) (SomeColAtom))
and the types of all row atoms are the same type, etc. then the matrix API does "neat thing". (It does NOT read from a file, though) Alternately, it also supports
(FooEvalLink
(FooPredicateNode "stuff")
(ListLink (SomeRowAtom) (SomeColAtom)))
as well,. I'm currently experimenting with much more awkward row-column representations.
What the matrix (aka vector) API does is allows you to have some complicated blob of data in the atomspace, and it allows you do declare that some subset of it looks "just like a matrix" or "just like a table" , and then it implements a bunch of generic table/matrix methods on it (currently, conditional probabilities, mutual information, cosine and jaccquard distances, etc.) -- It doesn't matter what the actual atoms really are, because all the algos just use the definition of the matrix to find the right atoms.
It would be nice if in some hazy future, MOSES would work the same way. Its probably too early for this, right now, but its an idea. (it would be nice to port the matrix API to C++, for speed, and to port it to "R", so that Mike and the biology guys could examine matrix-like slices of the atomspace in R. But that's a different project).
I read the README but it didn't seem clear to me how to incorporate that data in the Atomese evaluation, I mean for instance evaluating
(Plus (Schema "f1") (Schema "f2"))
where (Schema "f1")
and (Schema "f2")
would be two columns.
What I've been thinking though is to have a column stored as a list of values, such as FloatValue
. This would allow to easily associate columns to programs, and use this information as memoization mechanism. So for instance one could store the result of (Plus (Schema "f1") (Schema "f2"))
in a column, then when it's time to evaluate
(Times
(Plus (Schema "f1") (Schema "f2"))
(Schema "f3"))
it could reuse this column to avoid re-evaluating f1 + f2
.
Having said that, I'm not terribly concerned about efficiency at this point, I just want to attempt to move towards a direction that would foster "holistic" cognitive integration, like reasoning on programs, fitness functions and data.
@ngeiswei Assuming that we're going to implement this using option 2 (i.e by converting Table
to its atomese counterpart),
As described at https://github.com/opencog/as-moses/blob/master/moses/comboreduct/table/table.h#L911 Table
consists of an output table (OTable
) of one column and an input table (ITable
) consisting fo independent variables.
Typed data table. The table consists of an ITable of inputs (independent variables), an OTable holding the output (the dependent variable), and a type tree identifiying the types of the inputs and outputs.
Where as our current representation at #3 doesn't separately hold the output and input data. Do we need to change it to handle that?
@Yidnekachew I'm not sure what is best at this point. I suppose you may separate output and input data, like Table for now.
The other problem I'm seeing is that Boolean tables don't have any compact representation offered in #3 . Either we come up with one or you use the unfolded representation such as
(Evaluation (stv 0 1)
(Predicate "i1")
(Node "r1"))
(Evaluation (stv 1 1)
(Predicate "i2")
(Node "r1"))
(Evaluation (stv 1 1)
(Predicate "o")
(Node "r1"))
which has the "advantage" of forcing us to experiment with both representations and weight their pros and cons, maybe.
@ngeiswei If we're doing it both ways, a dataset like this
i1,i2,o
0,1,1
1,0,1
0,0,0
is going to be represented as:
For the Boolean type,
Using input and output table
(List
(Evaluation (stv 1 1) (Predicate "o") (Node "r1"))
(Evaluation (stv 1 1) (Predicate "o") (Node "r2"))
(Evaluation (stv 0 1) (Predicate "o") (Node "r3")))
(List
(Evaluation (stv 0 1) (Predicate "i1") (Node "r1"))
(Evaluation (stv 1 1) (Predicate "i2") (Node "r1"))
(Evaluation (stv 1 1) (Predicate "i1") (Node "r2"))
(Evaluation (stv 0 1) (Predicate "i2") (Node "r2"))
..
)
Using the unfolded table
(Evaluation (stv 0 1) (Predicate "i1") (Node "r1"))
(Evaluation (stv 1 1) (Predicate "i2") (Node "r1"))
(Evaluation (stv 1 1) (Predicate "o") (Node "r1"))
..
For the Real type,
Using input and output table
(Similarity (stv 1 1)
(List (Schema "i1") (Schema "i2"))
(Set
(List (Node "r1") (List (Number 0) (Number 1)))
(List (Node "r2") (List (Number 1) (Number 0)))
(List (Node "r3") (List (Number 0) (Number 0)))))
(Similarity (stv 1 1)
(List (Schema "o"))
(Set
(List (Node "r1") (Number 1))
(List (Node "r2") (Number 1))
(List (Node "r3") (Number 0))))
Using the compact format
(Similarity (stv 1 1)
(List (Schema "i1") (Schema "i2") (Schema "o"))
(Set
(List (Node "r1") (List (Number 0) (Number 1) (Number 1)))
(List (Node "r2") (List (Number 1) (Number 0) (Number 1)))
(List (Node "r3") (List (Number 0) (Number 0) (Number 0)))))
Am I right?
I will also need to have a look if Table
holds the column labels.
That's correct @Yidnekachew . Table
does hold the column labels, as well as the ITable and OTable, themselves holding their labels.
BTW, it's better if the first feature is the output (as its default MOSES' assumption, I've corrected #3 accordingly).
Other representations to consider would be
(List
(List (Schema "o") (Schema "i1") (Schema "i2"))
(List (Number 1) (Number 0) (Number 1))
(List (Number 1) (Number 1) (Number 0))
(List (Number 0) (Number 0) (Number 0)))
this one is probably the most compact and doesn't need to introduce row nodes. Its drawback is that it has no self-contained semantics.
Also, another option, to avoid having 2 distinct representation for Boolean and numerical data, is to use TrueLink http://wiki.opencog.org/w/TrueLink and FalseLink http://wiki.opencog.org/w/FalseLink.
I'm perhaps thinking of another representation that may have the advantage of that one above (i.e. doesn't introduces row nodes) yet is semantically self-contained. I'll come back later on that.
Meanwhile, here's my suggestion: since we're are more less stepping into the unknown (well as far as I am concerned I don't have a clear cut idea of what is gonna be best) I suggest you implement all options.
Obviously, an option to have the table compact type representation sorta semantically self-contained is to wrap it in a "AS-MOSES:table" predicate or something, like
(Evaluation
(Predicate "AS-MOSES:table")
(List
(List (Schema "o") (Schema "i1") (Schema "i2"))
(List (Number 1) (Number 0) (Number 1))
(List (Number 1) (Number 1) (Number 0))
(List (Number 0) (Number 0) (Number 0))))
that requires subsequent transformations to reason about it and the axiomatization of (Predicate "AS-MOSES:table")
, etc, but it's worth considering as well.
Question: which approach would be preferred?: