pgsql schema tables to triple store aka graphification

mjsduncan commented 1 year ago

hi linus! the idea is that each distinct column becomes a link between the primary key in each row and the value of that row for that column. i think sql-ness guarantees that there is a unique way to do this

linas commented 1 year ago

It would be best if you gave examples. I think that what you said was this:

Given a table stuff

columna  colb colc primarykey
------- ----- ---- --------
 foo     bar  baz    42
ding     dang dong   66

you want to get

(EvaluationLink
    (PredicateNode "stuff.columna")
    (List
       (Concept "foo")
       (Number 42)))

(EvaluationLink
    (PredicateNode "stuff.colb")
    (List
       (Concept "bar")
       (Number 42)))

(EvaluationLink
    (PredicateNode "stuff.colc")
    (List
       (Concept "baz")
       (Number 42)))

and likewise for the next row, is that right?

This is opposed to the current default map of


(EvaluationLink
    (PredicateNode "stuff")
    (List
       (Concept "foo")
       (Concept "bar")
       (Concept "baz")
       (Number 42)))

which preserves the original row structure.

Yes, I like that idea! it would use more memory, but I like the "denormalized" nature of this. It does make foreign keys a bit awkward; see next post.

I propose adding a "roll your own mapping" to the API, and this would be a good test case for that. Basically, you as a user, could define any kind of mapping you'd like.

linas commented 1 year ago

There is a bit of nastiness with foreign keys. Suppose there is a second table things that looks like

ca     cb    foreignkey primarykey
----   -----    ----  --------
 boo     far    42     24
 coo     gar    42     25
duke     dack   66     99

The point here is that the primary key is unique for that table, but the foreign key, there will typically be duplicates. (The foreign key references the primary key of the earlier table.)

Then you get the somewhat awkward content

(EvaluationLink
    (PredicateNode "things.ca")
    (List
       (Concept "boo")
       (Number 24)))

(EvaluationLink
    (PredicateNode "things.cb")
    (List
       (Concept "far")
       (Number 24)))

(EvaluationLink
    (PredicateNode "things.foreignkey")
    (List
       (Number 42)
       (Number 24)))

and somehow, outside of the system, you would need to know that things.foreignkey should match up with stuff.primarykey. This feels awkward. I'm not sure how build a better solution.

linas commented 1 year ago

One can permute things around, but this does not seem to gain anything.

Instead of writing

(EvaluationLink
    (PredicateNode "stuff.columna")
    (List
       (Concept "foo")
       (Number 42)))

one can write

(Member 
   (Concept "foo")
   (List
      (Concept "stuff.columna")
      (Number 42)))

But this does not feel "better", just "different". It's still the same triplet of three things: (col-value, coln-name, rowid) and you can shuffle that around however, but you cannot dissolve it.

mjsduncan commented 1 year ago

thanks, @linas! your first example is exactly what i'm thinking, except i don't think the table reference in the predicate name is necessary unless there are tables with different variables with the same name? is that even allowed in a sql schema?

i also don't think that the foreign key duplication is a problem. in your example, doesn't the duplication just mean that samples 24 and 25 have the same values (line 42) for the variables that live in the table referenced by that foreign key? i think spivak's category theory representation of sql schema shows that a well formed set of tables is isomorphic to a unique set of triples.

linas commented 1 year ago

Hi @mjsduncan

tables with different variables with the same name?

Two different tables can have columns with the same name, holding completely different data. Thus only tablename.colname is unique ... for a given SQL DB. If we try to map two or more of these into the atomspace at the same time, there could be collisions, if both DB's have tables with the same name ...

samples 24 and 25

They necessarily have different values. That's the whole point of foreign keys -- to allow many-to-one mappings. If it was always one-to-one, then you wouldn't use two tables, you would just use one table.

Spivak ... well formed set of tables

Heh. Long before Spivak, and before the internet, this was called "table normalization" and it is covered in textbooks, and consistently caused confusion in the poor students studying this stuff. You can see evidence of this even today, with assorted blog posts and stack-exchange questions about "table normalization". You will still see db admins explaining why they can't actually do it, due to reason xyz. You will see plenty of DBs designed by people who "never got the concept", and no one ever came around afterwards to clean it up.

And don't get me started on software written by grad students. It is almost the worst, lowest-quality software out there. I say almost, because software written by full professors is even worse. You might believe I'm saying this to be obnoxious or bombastic or gate-keeping, but it's true -- I worked at IBM for a while, and we'd get regular requests from both IBM Research, internally, and from external universities, wishing to license code for resale. We'd look at it, and the code would always be trash, you'd get estimates like "it would take us 30 man-years to rewrite the code, at a cost of $10 million, and the most we could ever make selling it would be $5 million because there are only 1000 users world-wide" type of estimates. This is the hard reality of this kind of stuff, why proprietary software is so freakin expensive, and why open source became really popular. Open-source overcame this crazy cost-vs-maintainability, (re-)design & bug-fixing barrier.

In short, I pin no great hopes on software and datasets from the genomics community; I figure its mostly written by grad students and professors. Maybe it's clean ... but the likelihood of that seems low.

mjsduncan commented 1 year ago

so insuring non-duplication of table names would have to be carried after import or by normalizing the db beforehand. without the table reference in the predicate node name, duplicate column names with different data would result in multiple values linked through that predicate node, right? it seems like it would still work...

i haven't tried spivak et al's categorical databse software cql but the theory is interesting because among other things it shows that schema map to simplicial sets, and the formulation can handle duplicate primary keys so you could potentially keep track of multiple probabilistically weighted entries.

mjsduncan commented 1 year ago

anyway, from the agi point of view, sucking an external db into the atomspace via the bridge is like a metabolic modality in cyberspace allowing absorption of dbs found in the environment to harvest for new knowledge!

linas commented 1 year ago

Hi @mjsduncan

Some short remarks:

after import

The atomspace bridge does not "import" the data. It is a two-way bridge, or a window between two places, or a "view" of the SQL data. The goal of writing the bridge was to avoid all the down-sides of data import/export. It gives you live access to a live database: changes there are reflected in the Atomspace, and vice-versa. It hooks the two together, unifying them into one. A harness.

by normalizing the db

This is not usually something that is either easy or quick. Whoever designed some particular SQL did it that way for a reason, and it's not going to be easy to do archeology to figure out why they did it that way, or to just "fix it" without risking data loss & inconsistency. But I suppose it depends on the DB.

insuring non-duplication of table names

But this is very easy: just give each DB a name! Like ... one line of code!

without the table reference in the predicate node name, duplicate column names with different data would result in multiple values linked through that predicate node, right?

I dunno, maybe; but why the heck would you want that? If you've got one database, with a column called "id" and it has peoples phone numbers, and a second database, with a column called "id", and its their social security numbers, why would you want to mash all that into one big ball?

theory is interesting

Careful. Theory is always interesting, but making things easy-to-use is much harder. That's why I keep asking for specific, concrete examples. Actually written in actual Atomese, reflecting what actual users would have to go through, day-to-day. The user experience is what matters.

mjsduncan commented 1 year ago

hi @linas, my use case is actually importing a snapshot of the data in the database as knowledge for the system. i don't need the extra table structure meant to optimize user queries. the theory shows that there is a minimal canonical graph/triple store representation of the database and that is what i want to put into the atomspace. the ultimate end user is the agi!

linas commented 1 year ago

Hi @mjsduncan If you can explicitly demonstrate what you want, then maybe I could do that. I cannot make progress without some explicit sketch of how to do this.

mjsduncan commented 1 year ago

hi @linas ! again, you laid out in your first couple comments exactly what i'm looking for. as the category theory makes clear (and you seem to agree) it's the natural thing to do. you say the mapping from the foreign key value to the primary key is "outside the system" but isn't that connection an explicit part of the schema?

mjsduncan commented 1 year ago

@linas here is a rough example of a way to deal with keys:

(EvaluationLink
    (PredicateNode "things.ca")
    (List
       (Concept "boo")
       (concept "key_24")

(EvaluationLink
    (PredicateNode "things.cb")
    (List
       (Concept "far")
       (Concept "key_24")))

i know below is wrong but it represents keys better than just a number

(folding operation                                                                      
    (list
       (pointer-function-to-table "things.foreignkey"                 
       (list of values in table row 42))
       (Concept "key_24"))

then there would need to be a pre or post processing step to remove redundant columns

linas commented 1 year ago

Hi @mjsduncan

here is a rough example of a way to deal with keys:

How does this differ from the existing structure? It appears that you took a key that is a number, and just appended the string key_ to the front of it, and then stored it as a string...

Please be aware that SQL allows keys to be numbers, strings, dates, timestamps, jpeg images, ... anything at all. The current bridge code does not treat keys differently than other columns: if a column is a number, it will turn into a (NumberNode...) and if it is a string, it will become a (ConceptNode ...) and I currently punt on dates, timestamps, jpeg images ...

Yes, I envision allowing a user to specify a translation mapping "convert table X column c1, c2 into Atoms type t1, t2" but just ... haven't gotten around to that yet.

folding operation

I don't understand what you're trying to accomplish there.

import

Have you actually tried to use the current code, to import the flybase dataset, and/or to use the browser to explore it?

When I tried to import the flybase in full, it got up over 100 GB, and I gave up at that point. The browser only loads as much as needed to browse some particular row or column, and seemed to work well, as long as I didn't tumble over some column with a bazillion rows in it. But I know nothing of drosophilia or genetics, so got bored after a while.

opencog / atomspace-bridge

pgsql schema tables to triple store aka graphification #1