The bridge connects pre-existing SQL DB's to the AtomSpace, so that data can move between the two, in both directions. It is meant to allow the AtomSpace to access foreign, external data, and manipulate it using the AtomSpace software stack. This is not intended for generic save/restore of AtomSpace contents. For that, see the native atomspace-rocks RocksDB StorageNode and the atomspace-cog network StorageNode.
The following are desired features:
At this time, only Postgres is supported.
Version 0.2.1 -- The current implementation can load table descriptions and table contents from Postgres DB's. It can also load related rows and columns (rows and columns joined by a common column name). Two demos: a basic demo, showing the basic idea, and an ASCII table browser, allowing you to pilot around, bouncing from table to table, along joins. It's an ASCII browser because all of the graphical browsers for the AtomSpace have been neglected. (These: atomspace-explorer, atomspace-typescript and cogprotolab.)
Per request of Mike Duncan, Dec 2022. Map the FlyBase Drosophila Genome Database (current release here) into the AtomSpace, using the Chado schema given here.
Here's a sketch for a generic mapping. This could work for any SQL database.
For each row in some table tablename
, create a conventional
Atomese EdgeLink.
Edge ; A single row in the table.
Predicate "tablename" ; Replace by actual name of table.
List ; This list is one row in the table.
Concept "thing" ; Entry in that column, if its a string.
NumberNode 42 ; If that column type is a number.
...
Concept "stuffs" ; Last column in the table.
The biggest design question is what to do with primary and foreign keys. The simplest solution is to "do nothing" and just import rows from SQL straight-up. Whatever primary/foreign keys are in the SQL tables, they will also appear in the AtomSpace. They will join automatically, due to fundamental AtomSpace architecture. With this solution, writing complex queries appears to be straight-forward (and easier than writing SQL queries). This works and is surprisingly flexible.
Other mappings are possible; however, there is no natural way of asking Postgres which table columns are foreign keys, and which other tables they might reference. If this info was available, then we could have a an Atomese representation that does NOT keep any keys at all in the AtomSpace, and instead just makes direct links between table rows. So, for example:
Edge
Predicate "some join relation"
Set
Edge ;;; row in the host table
Predicate "some tablename"
List ...
Edge ;; row in the target table.
Predicate "another tablename"
List ...
Edge ;; if more than two tables are joined.
Predicate "yet another tablename"
List ...
See the OpenCog wiki:
From what I can tell, the above alternate form uses about the same amount of RAM as having explicit keys in each row. It also takes just about the same amount of time to query over. So it does not seem to offer any size or performance advantage over brute-force primary/foreign keys.
It does seem to make things more readable, more "natural" in Atomese. It provides more flexibility: you can create and destroy joins at any time. You can join some rows but not others. All the usual stuff that makes the AtomSpace more flexible than SQL.
For just right now, we punt, and store the naked PRIMARY/FOREIGN KEY values as Atoms in the AtomSpace. It feels ugly, but it works.
SQL table definitions provide a definition of the columns of that
table. These are imported into the AtomSpace, where they are used
to find the names of the columns, and the types. (Note that the
EdgeLink
example above does not contain the column names.)
Below is the current mapping.
The SQL column names are recorded as VariableNode names. The type of each variable aka column is given.
Signature
Predicate "tablename" ;; Replace by actual table name.
List
TypedVariable
Variable "name of column 1"
Type 'ConceptNode ;; For SQL text/varchar
TypedVariable
Variable "name of column 2"
Type 'NumberNode ;; For SQL numbers
...
See the OpenCog wiki:
The base AtomSpace supports only a few primitive types; there's no direct analog to the varied primitive types that SQL has. So far, it doesn't seem to be needed. Perhaps the set of primitive types could be enriched, e.g. by creating:
StringNode
for SQL TEXT
and VARCHAR
(instead of ConceptNode)DateNode
for SQL dates and times.IntegerNode
(the existing
NumberNode is a vector of
floats. It works great, but some people just love ints.)Specific databases might benefit from custom types:
GeneNode
ProteinNode
URLNode
HumanReadableDescriptionNode
YourFavoriteIdeaHereNode
The agi-bio project provides the first two types. The cheminformatics project provides types for atomic elements and molecular binding.
Below is a sketch of how things could work. For actual examples that actually run and actually work correctly, see the examples directory.
Examples of accessing data in a foreign database.
; Describe where it is located.
(define foreign-db
(BridgeStorageNode "postgres://example.com/foo?user=foo&passwd=bar"))
; Open it.
(cog-open foreign-db)
; Load the *entire* table `gene.allele`. Optional; only if you
; actually want the whole table in RAM. The table name is known
; in advance, and is given directly.
(fetch-incoming-set (Predicate "gene.allele"))
; Instead of loading entire tables, perhaps we only want all
; rows of all tables that mention gene CG7069.
(fetch-incoming-set (Concept "CG7069"))
; Perhaps we plan to do a join. So, load *all* tables that have
; a given column name. In this case, all tables having a column
; called "genotype".
(fetch-incoming-set (Variable "genotype"))
; StorageNodes do have the ability to run generic queries, and
; we could, in principle, translate at least some of the simpler
; Atomese queries into SQL, and run those.
(fetch-query (Meet (Edge (Predicate "gene.allele")
(List (Concept "CG7069") (Glob "rest of the row")))))
; The fetch-query function is already built into the base, core
; AtomSpace. We could also create custom functions:
(cog-bridge-get-row "gene.allele" (Concept "CG7069"))
; Custom functions are not appealing, since they don't work
; with the rest of the StorageNode and ProxNode infrastructure.
It could be convenient to introduce special-purpose
Atom types, such as
GeneNode
from the agi-bio project. This would allow queries such
as
(fetch-incoming-set (Gene "CG7069"))
which could be marginally more efficient, presuming that the
gene.allele
table schema was properly declared:
Signature
Predicate "gene.allele"
VariableList
TypedVariable
Variable "symbol"
Type 'GeneNode
TypedVariable
Variable "is_alleleof"
Type 'NumberNode
TypedVariable
Variable "propagate_transgenic_uses"
Type 'NumberNode
TypedVariable
Variable "gene_is_regulatory_region"
Type 'NumberNode
...
This module works. It can load tables, it can load joining columns, and it can join rows, all "automatically". The build process is identical to that of other modules in OpenCog.
sudo make install
at the end.apt install postgresql postgresql-client libpq-dev
Be sure to install the pre-requisites first! Perform the following steps at the shell prompt:
cd to project root dir
mkdir build
cd build
cmake ..
make -j4
sudo make install
make -j4 check
Some issues to ponder:
What's the right solution to the foreign-key dilemma?
The flybase dataset seems to have a large number of extremely complex SQL stored procedures. What are they? What do they do? What should we do with them?
What to do with views?
What if the same table appears in multiple schemas?