Load CSV tables into AtomSpace

linas commented 2 years ago

This provides an ability to load plain-text tables (comma-separated values, tab-seperated values) into the AtomSpace.

The format allows Atomese programs to act on the columns of the table (add, subtract, etc.)

This is one of the important capabilities needed by old-style as-moses.

linas commented 2 years ago

BTW, @ngeiswei @Habush @kasimebrahim @Yidnekachew @Bitseat @Eman @behailu04 I'd like to bring your attention to the brand-new demo examples/atomspace/table.scm It does two things: it shows you how to load a CSV/TSV table into the atomspace -- this is now a "core function" in the atomspace. Next, it shows how to write functions that act on the table, and how to write scoring functions, in pure atomese. All this works outside of AS-MOSES.

The main difference here is that the main atomspace evaluator is used, instead of the as-moses evaluator. That means that all the functions from the AtomSpace are supported, and not just some of them. The functions look very similar to the as-moses atomese/combo trees; they're only a little bit different. There's an extra ValueOf link used to fish out the data from the columns. Otherwise, its more or less identical.

This opens the possibility for applying moses algos to non-table data, including video and audio data, or any kind of streaming data, or complex data sources. The Value system allows data to flow in from anywhere, in any way. The AS-MOSES system can then explore different kinds of mutations applied to data processing pipelines. I'm getting ready to tackle some of these data sources.

Anyway, thanks for your work in as-moses. It's not been in vain. The future is bright, methinks.

Bitseat commented 2 years ago

Hi Linas,

It is really great to hear the news and also great to hear from you. :) Congratulations on the big achievement and I thank you for the recognition.

Kind regards, Bitseat

On Sun, Aug 21, 2022 at 1:32 PM Linas Vepštas @.***> wrote:

Merged #2989 https://github.com/opencog/atomspace/pull/2989 into master.

— Reply to this email directly, view it on GitHub https://github.com/opencog/atomspace/pull/2989#event-7227522900, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGKAU5AAF7BAHKLINETGCF3V2IAU7ANCNFSM57D4MTCQ . You are receiving this because you were mentioned.Message ID: @.***>

kasimebrahim commented 2 years ago

That is impressive @linas! And thank you for keeping us in the loop.

mjsduncan commented 2 years ago

this is very cool, thanks linus! how hard would it be to expand this to import sql db dumps and reproduce the relationships of the connecting keys between tables?

linas commented 2 years ago

Hi @mjsduncan -- not hard. Not easy. It depends.

Let me start with a question. Do you want the SQL data as Values, or as Atoms? The CSV mapping puts an entire column of a table into a single vector value (because this is "natural" for moses) An alternative would have been to take each row of the table, and convert it into a (EvaluationLink (PredicateNode "my CSV table") (List (Concept "...") (DateNode "...") (NumberNode ...)))

The nice thing about using vectors is that they're fast, compact, uniform. The bad thing is they're not searchable. By contrast, you can search (pattern-match) the EvaluationLinks; but they're slower, bulkier.

Long ago, I came up with this idea, never implemented. Tell me what you think. It goes like this:

There would be a mapping, from an SQL table, to some AtomSpace structure. So for example SQL TABLE Foo (Name STRING, Date TIME, Location INT) would map to (EvaluationLink (PredicateNode "Foo") (List (Concept "...") (DateNode "...") (NumberNode ...))) The mapping is user-specified, so it would not have to be an EvaluationLink it could be whatever you want.
A connection would be made to a running SQL server. So, instead of working from a dump, you could work with live data in a live DB. So, basically, you'd be "mirroring" the SQL data in the AtomSpace. Not only reading it, but if there are changes, updates, these changes would be written out to the DB. Doesn't even have to be SQL, could be "any" data source.

I don't know if you're interested in the second bullet or not. If you're working with biology databases, then maybe working from dumps is all you want. Maybe the live data connection isn't needed. The live data connection is trickier, harder and more fragile.

One "hard part" is coming up with a generic way of allowing the user to specify what the table-to-atomese mapping is. I've got ideas for this (See wiki page for SignatureLink...) but it would take some polishing to get it right.

Excuse me. As I write the above, I just realized there are two easy tricks... Just click here. One trick is to create a TableValue and it would take all (EvaluationLink (PredicateNode "my CSV table")...) and return corresponding FloatValue vectors for each column. Then there could also be the inverse: some ExportTableLink, which, given vectors, would create a brand-new EvaluationLink for each row.

mjsduncan commented 2 years ago

thanks for the detailed reply, linus. i'm definitely thinking of importing data as atoms, and ultimately converted into a more compact and semantically meaningful form than the original tables, otherwise what would be the point? what i'm interested in is importing a whole database, tho i can see the value in what would be a sql interface module so info from a sql db could be imported as needed for evaluation & inference.

my question is motivated by the existence of a relational db schema and related tools that are being used to compile data on model organisms: http://gmod.org/wiki/Overview#Chado_and_BioSQL (description of schema) https://www.alliancegenome.org/ (meta organization with 6 model organism groups using shared infrastructure)

importing these into an atomspace would be fertile ground for developing automated biological inference systems

linas commented 2 years ago

Hi Mike,

The way to move forward is to open a new issue on github, describe the general desired features, and reference the discussion here. We should continue the discussion there.

To build this, make things concrete: pick the 1 or 2 schema that seem to be the most important for you, copy them into the issue. Then write down the matching AtomSpace structures that these would be converted into. Basically, provide a detailed example. This will allow me to think concretely about how to implement things.

Where's the data? Do you just want to import database dumps stored in some compressed files? Or will you set up a server somewhere, running some DB, that will hold the data? If there's some server, what is it? postgres? mariadb? reddis? something else? I would need to know, in order to connect to it, interact with it.

opencog / atomspace

Load CSV tables into AtomSpace #2989