Loading data to AtomSpace

hilenamin commented 6 years ago

Question: which approach would be preferred?:

[ ] load csv directly into atomese
[x] load the data into a table, which can then be converted to atomese.

ngeiswei commented 6 years ago

Although it is somewhat of a test issue let me answer it.

Ideally you want to load csv directly into atomese, to not pay the overhead of creating a structure like Table. However, I feel you would get more done with less code if you simply reuse https://github.com/opencog/as-moses/blob/master/moses/comboreduct/table/table_io.h#L151 then turn a table into atomese.

The completely ideal solution would be reuse as much as you can from loadTable (and its subroutines), possibly refactoring code when necessary, without recreating a Table, but since that table is an intermediary structure anyway it's not much overhead so for now just using plain loadTable should be fine. The ideal solution can be sought after only if it become a performance critical issue.

linas commented 6 years ago

Nil, How are you representing tables? Again, I want to draw your attention to the module (opencog matrix) which deals with sparse tables. As long as you represent your data as one of these:

(SomeLink (SomeRowAtom) (SomeColAtom))

and the types of all row atoms are the same type, etc. then the matrix API does "neat thing". (It does NOT read from a file, though) Alternately, it also supports

(FooEvalLink
   (FooPredicateNode "stuff")
   (ListLink (SomeRowAtom) (SomeColAtom)))

as well,. I'm currently experimenting with much more awkward row-column representations.

What the matrix (aka vector) API does is allows you to have some complicated blob of data in the atomspace, and it allows you do declare that some subset of it looks "just like a matrix" or "just like a table" , and then it implements a bunch of generic table/matrix methods on it (currently, conditional probabilities, mutual information, cosine and jaccquard distances, etc.) -- It doesn't matter what the actual atoms really are, because all the algos just use the definition of the matrix to find the right atoms.

It would be nice if in some hazy future, MOSES would work the same way. Its probably too early for this, right now, but its an idea. (it would be nice to port the matrix API to C++, for speed, and to port it to "R", so that Mike and the biology guys could examine matrix-like slices of the atomspace in R. But that's a different project).

ngeiswei commented 6 years ago

I read the README but it didn't seem clear to me how to incorporate that data in the Atomese evaluation, I mean for instance evaluating

(Plus (Schema "f1") (Schema "f2"))

where (Schema "f1") and (Schema "f2") would be two columns.

What I've been thinking though is to have a column stored as a list of values, such as FloatValue. This would allow to easily associate columns to programs, and use this information as memoization mechanism. So for instance one could store the result of (Plus (Schema "f1") (Schema "f2")) in a column, then when it's time to evaluate

(Times
  (Plus (Schema "f1") (Schema "f2"))
  (Schema "f3"))

it could reuse this column to avoid re-evaluating f1 + f2.

Having said that, I'm not terribly concerned about efficiency at this point, I just want to attempt to move towards a direction that would foster "holistic" cognitive integration, like reasoning on programs, fitness functions and data.

Yidnekachew commented 6 years ago

@ngeiswei Assuming that we're going to implement this using option 2 (i.e by converting Table to its atomese counterpart),

As described at https://github.com/opencog/as-moses/blob/master/moses/comboreduct/table/table.h#L911 Table consists of an output table (OTable) of one column and an input table (ITable) consisting fo independent variables.

Typed data table. The table consists of an ITable of inputs (independent variables), an OTable holding the output (the dependent variable), and a type tree identifiying the types of the inputs and outputs.

Where as our current representation at #3 doesn't separately hold the output and input data. Do we need to change it to handle that?

ngeiswei commented 6 years ago

@Yidnekachew I'm not sure what is best at this point. I suppose you may separate output and input data, like Table for now.

The other problem I'm seeing is that Boolean tables don't have any compact representation offered in #3 . Either we come up with one or you use the unfolded representation such as

(Evaluation (stv 0 1)
  (Predicate "i1")
  (Node "r1"))
(Evaluation (stv 1 1)
  (Predicate "i2")
  (Node "r1"))
(Evaluation (stv 1 1)
  (Predicate "o")
  (Node "r1"))

which has the "advantage" of forcing us to experiment with both representations and weight their pros and cons, maybe.

Yidnekachew commented 6 years ago

@ngeiswei If we're doing it both ways, a dataset like this

i1,i2,o
0,1,1
1,0,1
0,0,0

is going to be represented as:

For the Boolean type,

Using input and output table

(List 
 (Evaluation (stv 1 1) (Predicate "o") (Node "r1"))
 (Evaluation (stv 1 1) (Predicate "o") (Node "r2"))
 (Evaluation (stv 0 1) (Predicate "o") (Node "r3")))

(List 
 (Evaluation (stv 0 1) (Predicate "i1") (Node "r1"))
 (Evaluation (stv 1 1) (Predicate "i2") (Node "r1"))
 (Evaluation (stv 1 1) (Predicate "i1") (Node "r2"))
 (Evaluation (stv 0 1) (Predicate "i2") (Node "r2"))
 ..
)

Using the unfolded table

(Evaluation (stv 0 1) (Predicate "i1") (Node "r1"))
(Evaluation (stv 1 1) (Predicate "i2") (Node "r1"))
(Evaluation (stv 1 1) (Predicate "o") (Node "r1"))
..

For the Real type,

Using input and output table

(Similarity (stv 1 1)
  (List (Schema "i1") (Schema "i2"))
  (Set
    (List (Node "r1") (List (Number 0) (Number 1)))
    (List (Node "r2") (List (Number 1) (Number 0)))
    (List (Node "r3") (List (Number 0) (Number 0)))))

(Similarity (stv 1 1)
  (List (Schema "o"))
  (Set
    (List (Node "r1") (Number 1))
    (List (Node "r2") (Number 1))
    (List (Node "r3") (Number 0))))

Using the compact format

(Similarity (stv 1 1)
  (List (Schema "i1") (Schema "i2") (Schema "o"))
  (Set
    (List (Node "r1") (List (Number 0) (Number 1) (Number 1)))
    (List (Node "r2") (List (Number 1) (Number 0) (Number 1)))
    (List (Node "r3") (List (Number 0) (Number 0) (Number 0)))))

Am I right?

I will also need to have a look if Table holds the column labels.

ngeiswei commented 6 years ago

That's correct @Yidnekachew . Table does hold the column labels, as well as the ITable and OTable, themselves holding their labels.

ngeiswei commented 6 years ago

BTW, it's better if the first feature is the output (as its default MOSES' assumption, I've corrected #3 accordingly).

ngeiswei commented 6 years ago

Other representations to consider would be

(List
  (List (Schema "o") (Schema "i1") (Schema "i2"))
  (List (Number 1) (Number 0) (Number 1))
  (List (Number 1) (Number 1) (Number 0))
  (List (Number 0) (Number 0) (Number 0)))

this one is probably the most compact and doesn't need to introduce row nodes. Its drawback is that it has no self-contained semantics.

Also, another option, to avoid having 2 distinct representation for Boolean and numerical data, is to use TrueLink http://wiki.opencog.org/w/TrueLink and FalseLink http://wiki.opencog.org/w/FalseLink.

I'm perhaps thinking of another representation that may have the advantage of that one above (i.e. doesn't introduces row nodes) yet is semantically self-contained. I'll come back later on that.

Meanwhile, here's my suggestion: since we're are more less stepping into the unknown (well as far as I am concerned I don't have a clear cut idea of what is gonna be best) I suggest you implement all options.

ngeiswei commented 6 years ago

Obviously, an option to have the table compact type representation sorta semantically self-contained is to wrap it in a "AS-MOSES:table" predicate or something, like

(Evaluation
  (Predicate "AS-MOSES:table")
  (List
    (List (Schema "o") (Schema "i1") (Schema "i2"))
    (List (Number 1) (Number 0) (Number 1))
    (List (Number 1) (Number 1) (Number 0))
    (List (Number 0) (Number 0) (Number 0))))

that requires subsequent transformations to reason about it and the axiomatization of (Predicate "AS-MOSES:table"), etc, but it's worth considering as well.

opencog / asmoses

Loading data to AtomSpace #12