opencog / atomspace

The OpenCog (hyper-)graph database and graph rewriting system
https://wiki.opencog.org/w/AtomSpace
Other
813 stars 225 forks source link

Storing large links in postgres. #650

Closed leungmanhin closed 1 year ago

leungmanhin commented 8 years ago

I noticed there were errors sometimes when doing sql-store, and they look similar to this one:

Can't perform query rc=-1 (7) ERROR: index row size 4440 exceeds maximum 2712 for index "atoms_type_outgoing_key";

I may be able to provide more info on how to reproduce it if I see this error again...

linas commented 8 years ago

googling this error message explains where it is from: you are attempting to store a link with an outgoing set that has more than 338 atoms in it. (in fact, it seems to have 554 atoms in it)

I'm assuming this is from a wikipedia parse, and so its some "sentence" that is 554 words long. This suggests the sentence splitter is misbehaving.

linas commented 8 years ago

We should probably alter the postgres backend to check the size of the outgoing set, and also check the node name length, and throw an error, if they are too long.

linas commented 8 years ago

Pull request #651 adds the check. Now, instead of getting a cryptic SQL error, you will get an error message that has more of an explanation about the problem. It doesn't fix the problem, but gives you something to think about.

leungmanhin commented 8 years ago

Thanks, yes indeed that's from a wikipedia parse, I'll have a look at the sentence splitter as well

linas commented 8 years ago

well, teh pull equest does a throw. If you put a print in there, you cn print the atom that cause the fail.

inflector commented 8 years ago

@leungmanhin Can you give me an easy way to reproduce this problem? Is the problem only that an outgoing set is too large? i.e. a Link has too many atoms in it?

leungmanhin commented 8 years ago

@inflector Yes, an outgoing set is too large is the problem, it can be reproduced by doing:

(use-modules (opencog nlp lg-dict))
(use-modules (opencog persist-sql))

; a way to generate a large outgoing set
(lg-get-dict-entry (WordNode "drink"))

; store to db
(sql-open "mycogdata" "opencog_user" "cheese")
(sql-store)

; should then give an exception

This may be useful for creating a database: https://github.com/opencog/atomspace/tree/master/opencog/persist/sql#user-and-database-setup

linas commented 8 years ago

Oof dah. So:

(cog-arity (gdr (gar (lg-get-dict-entry (WordNode "drink")))))

reports 8210 atoms in the outgoing set. It begins like this:

(SetLink
   (LgWordCset
      (WordNode "drink")
      (LgOr
         (LgConnector
            (LgConnectorNode "XXXGIVEN")
            (LgConnDirNode "+")
         )
         (LgConnector
            (LgConnectorNode "A")
            (LgConnDirNode "-")
            (LgConnMultiNode "@")
         )
         (LgAnd

and its the LgOr that has the 8210 disjuncts in it. I'm not sure what to do.

linas commented 8 years ago

OK, here's the deal. When the SQL backend was written 8 years ago, it originally stored each element in the outgoing set as its own row in a table, the "edges" table. It became immediately apparent that this was excessively verbose and inefficient: for example, instead of doing 1 query to get a link, you had to do N queries (or alternately 1 query which returned N rows, and you then had to process each row).

To avoid this bloat and overhead, the outgoing set was stored as a list, as part of the link atom. Worked great, until today :-)

The question is: what to do about it? Well, the old "edges" code is still there, its if-def'd out. Its surely bit-rotted and probably won't compile any more.

(a) We could have a check so that the edges code gets used only if a link has an outgoing set larger than N=300 (b) We could store the outgoing set in a blob. (c) We could use the edges code, but then write some magic SQL query optimization code to create a table view or an inner join, or some kind of SQL magic, that would cut down on the query processing overhead, and keep things fast.

I know enough SQL to know that (c) is possible, but I don't know enough to make (c) work well.

Options (b) and probably (c) all have a huge drawback: One of the important queries is "given this outgoing set, find me the atom that contains it" This is a must-have query, its fundamental.

Option (a) allows this query, but it does not allow safety. So, today, there is a UNIQUE INDEX on the table, that prevents inserting two atoms with the same outgoing set. Before this index, such inserts were theoretically impossible, but somehow managed to happen anyway (maybe a threading issue? a race condition?? )

In order to get safety with option (a), we would have to do atomic transactions, whenever we used the overflow path.

Also, the UNIQUE INDEX makes link lookup fast -- its not clear to me how to index the edge set so that we can quickly look up links.

linas commented 8 years ago

This might be a case, maybe, where having a backend to a proper graph database would be better. By "proper", I mean one which doesn't use REST and java hijinx, paying a huge performance penalty. i.e. we'd need to have a graph database with an API that is on par with postgres, rather than a few orders of magnitude slower. Of course, exploring this option is also very labor-intensive.

inflector commented 8 years ago

It looks like from my read of the code and your comments that the indices are there mainly to give UNIQUE constraints good performance. For example, this looks like the recent comment for this bug which you wrote hints at the UNIQUE constraint being the real issue:

// The Atoms table has a UNIQUE constraint on the
// outgoing set.  If a link is too large, a postgres
// error is generated:
// ERROR: index row size 4440 exceeds maximum 2712
// for index "atoms_type_outgoing_key"
// The simplest solution that I see requires a database
// redesign.  One could hash together the UUID's in the
// outgoing set, and then force a unique constraint on
// the hash.

So, this proposes using a hash of the outgoing set as a proxy value over which one could maintain the UNIQUE constraint. This would have the side effect of being much faster if one used a relatively fast hash here and added a column for that hash with the UNIQUE constraint. With this approach, one could use arrays of arbitrary size and still get the unique constraint.

How much do you need a guarantee on the hash uniqueness? Do you want something cryptographic, or nearly so like Md5 or SHA1. We could test for the typical cases but I'd bet they're pretty fast compared with the overhead of a postgres index.

@linas You mentioned the index speeding up link lookup, which I assume you mean AtomStorage::getIncomingSet which does a:

SELECT * FROM Atoms WHERE outgoing @> h->_uuid

which would definitely benefit from the index. The AtomStorage::getLink would as well but the same index on the hashed outgoing set used for the UNIQUE constraint would also work in this case.

So the question is: how often is the call to fetch-incoming-set used? I see a couple of references in the scheme code, but it's not clear to me how central to the algorithm these calls are. If they are indeed performance critical. We'd be better off having the atoms loaded into the AtomSpace, I'd think.

What is the typical and expected use case for the persistence? It appears to be to support atoms which are useful and want to be retained as part of the analogue to long-term memory for the AtomSpace. If this is the case, then it would be easier and more performance to mark the atoms with a flag and then persist them in batches asynchronously as part of a single transaction to reduce transitional overhead rather than one at a time. If we're going to look at a redesign, we might as well do the one we'll want in two years.

How big a factor is speed and performance for our use cases, writing the atoms of wikipedia parses, for example? Or loading them up for another testing run? How often does this happen?

There are a few performance issues I've noticed, for example. The persist code is using ODBC drivers. That might have made sense before AtomSpace was tied to Postgres, but once you're supporting one database, you might as well go fast and close to the metal. ODBC queries are much slower for the types of queries that I see here in the AtomSpace persist code. You can use the C drivers for postgres and commands like PQprepare and PQexecPrepared that completely eliminate the query processing overhead since one can exec a prepared query with different parameters as many times as you want.

@leungmanhin How much does performance matter in your typical use case? Are you spending a lot of time waiting for writes or loads from the postgres store?

linas commented 8 years ago

Woof. Lots of questions.

On Tue, Feb 23, 2016 at 7:27 PM, Curtis Faith notifications@github.com wrote:

It looks like from my read of the code and your comments that the indices are there mainly to give UNIQUE constraints good performance. For example, this looks like the recent comment for this bug which you wrote hints at the UNIQUE constraint being the real issue:

// The Atoms table has a UNIQUE constraint on the // outgoing set. If a link is too large, a postgres // error is generated: // ERROR: index row size 4440 exceeds maximum 2712 // for index "atoms_type_outgoing_key" // The simplest solution that I see requires a database // redesign. One could hash together the UUID's in the // outgoing set, and then force a unique constraint on // the hash.

I forgot about this idea. I guess it could work, modulo comments below.

So, this proposes using a hash of the outgoing set as a proxy value over which one could maintain the UNIQUE constraint. This would have the side effect of being much faster if one used a relatively fast hash here and added a column for that hash with the UNIQUE constraint. With this approach, one could use arrays of arbitrary size and still get the unique constraint.

Yes. Recall that atoms are immutable, so in fact, the outgoing set is used in only two ways: (a) to find out if the link is already in the database, and if so, fetch its' TV, and (b) when get-incoming-set is called.

For use-case (a), its enough to store the hash only (if we can avoid hash collisions; but I doubt we can, see below), for case (b) we have to store the full outgoing set, anyway. Or something.

How much do you need a guarantee on the hash uniqueness? Do you want something cryptographic, or nearly so like Md5 or SHA1.

In the end, identifying an atom has to be totally and completely unique, so we cannot depend solely on a hash; well need hash buckets. Hashes can collide due to the birthday paradox. A typical atomspace might have 1 million to 1 billion atoms in it, so I guess birthday collisions remain rare for md5 ....

We could test for the typical cases but I'd bet they're pretty fast compared with the overhead of a postgres index.

Heh. Postgres powers large parts of the interwebs; I wouldn't bet against it.

@linas https://github.com/linas You mentioned the index speeding up link lookup, which I assume you mean AtomStorage::getIncomingSet which does a:

SELECT * FROM Atoms WHERE outgoing @> h->_uuid

which would definitely benefit from the index.

I don't think the incoming set has an index, but it woudl probably be wise to have one somehow.

The AtomStorage::getLink would as well but the same index on the hashed outgoing set used for the UNIQUE constraint would also work in this case.

yes

So the question is: how often is the call to fetch-incoming-set used?

In principle, it could be a lot: this is a very important call to explore the immediate neighborhood of a graph. In practice, almost no one is using any backend for almost anything. Right now there are two or three existing uses, one planned use:

  • language learning, to store counts and statistics
  • Whatever ManHin is doing with the parsed wikipedia stuff
  • Whatever the AGI-Bio people are doing w/ protein and DNA data.

Planned use:

I see a couple of references in the scheme code, but it's not clear to me how central to the algorithm these calls are. If they are indeed performance critical. We'd be better off having the atoms loaded into the AtomSpace, I'd think.

?? Don't know about AGI-Bio, but my experience with langauge learning is that it runs fastest if the atomspace is agressively trimmed, and as much as possible is pushed out to postgres. The reasons for this is that adding and removing atoms to the atomspace becomes very expensive when the atomspace is big; its a lot cheaper/faster to keep them salted away in postgres.

What is the typical and expected use case for the persistence?

Besides the above, it is currently the only way of doing distributed atomspaces. You can have N atomspaces all talking to the same postgres server, and they'll stay in sync (if you're not stupid about it). I have gone up to N=4. I think it can scale to N=100 depending on use case, although you also have to have a fancy postgres setup, to, to get to that. Maybe even N=1000; it really depends a lot on the use case.

It appears to be to support atoms which are useful and want to be retained as part of the analogue to long-term memory for the AtomSpace.

Yes.

If this is the case, then it would be easier and more performance to mark the atoms with a flag

yes. its called "VLTI" in the AV. Right now, VLTI is not used for this or for anything.

and then persist them in batches asynchronously as part of a single transaction to reduce transitional overhead rather than one at a time.

I seriously doubt it. That's just not how SQL works.

Anyway, the stores are already asynchornous. The backend has 4 threads doing write-backs. The threads have high-low watermarks and everything. You can confgure a number other than 4, if you wish.

If we're going to look at a redesign, we might as well do the one we'll want in two years.

How big a factor is speed and performance for our use cases, writing the atoms of wikipedia parses, for example? Or loading them up for another testing run? How often does this happen?

Beats the heck out of me. Completely unknown. Last time I profiled this stuff, it was 2008.

There are a few performance issues I've noticed, for example. The persist code is using ODBC drivers. That might have made sense before AtomSpace was tied to Postgres, but once you're supporting one database, you might as well go fast and close to the metal. ODBC queries are much slower for the types of queries that I see here in the AtomSpace persist code. You can use the C drivers for postgres and commands like PQprepare and PQexecPrepared that completely eliminate the query processing overhead since one can exec a prepared query with different parameters as many times as you want.

Yeah, ODBC sucks. But it was the wave of the future in the 2000's. Funny how "wave of the future" technology turns into "it sucks" technology in about a decade. Especially when it comes from Microsoft.

Yes, we can ditch ODBC. I kind-of doubt that its a bottleneck here, but I dunno.

@leungmanhin https://github.com/leungmanhin How much does performance matter in your typical use case? Are you spending a lot of time waiting for writes or loads from the postgres store?

Performance matters. The three mentioned above are the only "typical" ones we have. The bottlenecks in them are unknown. The SLA service level agreement is unknown.

--linas

— Reply to this email directly or view it on GitHub https://github.com/opencog/atomspace/issues/650#issuecomment-188000097.

leungmanhin commented 8 years ago

@inflector Currently my usage is more about retaining atoms, and I just loaded them back into the atomspace when I need them. But as Linas mentioned above, if the atomspace is big, it's a lot cheaper/faster to keep them in postgres, so I think performance does matter in the near future

inflector commented 8 years ago

Yeah, I've been through quite a few of those wave-of-the-future cycles. Some of them take, but far more of them are dead ends.

The reason I'm asking so many questions is that Ben wanted me to take a look at this bug since he figured you'd be busy doing other things, and this is preventing @leungmanhin from finishing his work, and I think he knew I had a database background.

Database internals was my sole focus in the late 80s and early 90s. It's been 15 years but I've even had deep architectural discussions with Tom Lane and a few of the other long-time PostgreSQL team back when I was using PostgreSQL to store price data for realtime data feeds and wanted higher transaction performance. I got Tom to investigate one issue and while he didn't take my suggestion which my testing indicated would have resulted in 20x improvement of write transaction throughput for small writes with AIO-capable filesystems, he did manage to improve those write transactions by 60% in a few hours work looking at the change I suggested.

So, I know the core internals of PostgreSQL fairly well, or at least, I once did. And I doubt they've changed too much given Tom's conservative personality.

Since preserving the performance of getIncomingSet is important, then the row-per-atom outgoing set representation is the best way to go. It may sound like it will be a big performance hit, but that's only if you go through ODBC and don't use the postgres calls for doing simultaneous inserts (as described here: http://stackoverflow.com/questions/19516645/how-to-insert-a-single-row-in-the-parent-table-and-then-multiple-rows-in-the-chi). Databases, and PostgreSQL especially, are optimized for lots of small rows. That was the originally the bottleneck for the most common use case, the ubiquitous master-detail aka invoice representation. So inserts and selected for lots of small records are optimized pretty well.

I'm betting that you'd get equivalent or better performance than our existing array mechanism by dropping ODBC and moving to a single-statement insert (as described above). Even with ODBC, you can still implement multi-row inserts, you just can't do both the atom insert and the outgoing set insert in one call. At least I couldn't find a way to do it.

But there's only one way to find out, that's to do it. That's the one truth that always holds with complex database access, especially in a network. Test and see. Since you've got the code already working, the first step is to just use that code and see how it performs. Then move to multi-row inserts with ODBC. Then if that isn't fast enough, move to the single-insert made possible by going to PostgreSQL directly. One PostgreSQL user recently found libpq to be 14X faster than ODBC with inserts. See: http://www.postgresql.org/message-id/9CE034E149417949A58AA9A4FA7E1C5584AF2B48@ORSMSX109.amr.corp.intel.com

Even with a row-per-atom in PostgreSQL, there is a limit to the size of the outgoing set that can be saved all in one go, based on the two-byte integer number of parameters in a prepare statement (i.e. a 32K limit), which translates to 16K in the outgoing set. This limit can be worked around by chunking the writes in 16K pieces in a loop. For the most common case, it will be one statement, for 20K in the outgoing set, it would be two inserts.

linas commented 8 years ago

You are welcome to do a complete re-write. In order to maintain compatibility for at least a little while, I suggest making a copy of the existing code, and creating a driver-version-two in a parallel directory. I have existing language datasets, so does Rohit, so the old code has to continue working as-is.

There is also a different bug that needs to be solved: the LG disjunct set is being represented incorrectly. Instead of having an LgOr of 8210 disjuncts, there instead should just be 8210 disjuncts linked to the given word. That way, new disjuncts can be added, and old ones removed, without re-writing the whole list. I'm creating a new bug report for that issue.

linas commented 8 years ago

I opened bug opencog/opencog#2050 to describe the LG fix. It might be easier to fix that, than to redesign the SQL backend. In some ways, it is also a much more serious issue: the LG disjuncts are represented in a fatally incorrect way.

Side-comment: if someone is creating data that overflows the current SQL design, they are almost surely mis-using the atomspace, i.e. are representing their data incorrectly. Which is the case here ... Basically, its kind of crazy or incorrect to have an outgoing set that is larger than 5 or 10. An outgoing set of size 20 is insanely, crazily large, and hundreds... thousands .. is just wrong.

Motto of the day: the atomspace is a "DNF engine" -- lets use it as its meant to be used!

linas commented 8 years ago

Another side-comment: the description in opencog/opencog#2050 illustrates a "typical" query for cog-fetch-incoming-set and why one might expect it to be fast.

leungmanhin commented 8 years ago

Also a side note, the lg-get-dict-entry that I used above was just a way that came to my mind that night to quickly create a link with a large outgoing set in order to reproduce this bug, nlp-parse doesn't generate it, sorry for giving a wrong impression, I think this bug is still worth fixing (but in a parallel directory as Linas suggested)

linas commented 8 years ago

OK. I am also thinking that "good" data designs should not have SetLinks/ListLinks with more than 10 or 20 items in them. There are several reasons for this:

So, I'm thinking: instead of using and storing SetLinks/ListLinks, instead use MemeberLinks instead. This solves both problems:

As a side effect, all outgoing sets will be tiny, thus avoiding the SQL bug!

inflector commented 8 years ago

Okay, I've finally got to the point where I've got a performance comparison using the new Edges table saving of outgoing sets with hashes for collision avoidance. I tested random stores of 1,000,000 atoms generated using the new RandomAtomGenerator class derived from the benchmark code with some improvements. The arity of links is determined using a Poisson distribution with an average mean arity of 5.0 for this test.

The test data was 70% nodes with the remaining 30% (300,000) links distributed as follows:

 17,888 atoms at height 3
 84,872 atoms at height 2
197,240 atoms at height 1

The tests were conducted using the exact same schema but in one case the outgoing links were hashed using MurmurHash2 into a 64-bit hash and stored with an associated 32-bit differentiator to be used in the event of collisions. MurmurHash2 is very fast and exhibits very good randomness for the types of data we're using here. NOTE: I have not implemented the collision handling but this test generated zero collisions.

The test does have all the UNIQUE constraints with corresponding indexes that will be needed for handling collisions so the performance of database inserts will be the same except in the case of a collision where the Edges table will have to be read and compared directly.

Here are the relevant table CREATE statements:

CREATE TABLE Atoms (
        ... other columns go here ...

    out_hash BIGINT,
    out_differentiator SMALLINT,

    -- An array of the outgoing edges; non-empty only for links
    outgoing BIGINT[],

    -- Force the uniqueness of atoms!!
    UNIQUE (type, name),
    UNIQUE (type, out_hash, out_differentiator),
    UNIQUE (type, outgoing)
);

--
CREATE TABLE Edges (
    src_uuid  BIGINT,
    dst_uuid  BIGINT,
    pos INT
);
CREATE INDEX src_idx ON Edges(src_uuid, pos);
CREATE INDEX dst_idx ON Edges(dst_uuid);

Here are the performance measurements for both the current implementation where where they are stored (as currently) in the outgoing BIGINT array column, and the new code where outgoing sets are stored in an Edges table and a hash of the outgoing is stored in the out_hash and out_differentiator columns:

                   Elapsed          per store     stores / sec  CPU time
                   -------          ---------     ------------  --------
Outgoing Array      375.91 seconds     375.91 µs          2660    168.12 seconds
Edges Table         283.23 seconds     283.23 µs          3531    117.97 seconds

These numbers are both using the current ODBC drivers. So they only compare the difference between the two approaches and not between ODBC and PostgreSQL native. So each link write is two SQL statements, an insert into the Atoms table for the link itself, and a single insert of all the rows for the atoms in the link's outgoing set. Once I switch to the PostgreSQL connection library, I'll try the single-write approach which will allow us to write both tables in one insert statement.

So it looks like this new approach is significantly faster than the current approach for writes. However, I have not yet tested AtomSpace loads and it is possible that these will be slower.

The one open issue is how to handle collisions. The easiest and most performant way to handle them is to store the hash bucket differentiator (defaulting to 0) in the links themselves. When storing a new link atom, if there is a collision, this could be for two reasons, by far the most common reason will be that the link (Type, Outgoing Set) already exists in the database; the very uncommon case will be a collision of the hash algorithm for different underlying (Type, Outgoing Set) pairs. In the event of a detection of an identical hash value the outgoing sets will have to be compared in code. In the event of a non-duplicate outgoing set collision, the maximum bucket differentiator for the given hash value will be incremented by one and that number stored in the Atoms table out_differentiator column to maintain a unique (out_hash, out_differentiator) pair.

When atoms are loaded, this bucket differentiator can be read from the Atoms table. So updates of already stored links will already have the correct out_differentiator set.

Assuming everyone concurs and no one sees a problem with adding a new data member to links, I'll proceed with implementing this collision handling algorithm.

linas commented 8 years ago
linas commented 7 years ago

There have been huge changes to the SQL backend, see in #1096 #1095 #1093 #1091 #1081 #1079 #1076 #1075 #1074 #1073 #1070

The upshot is that there is now native postgres support, and a variety of other bugs have been fixed. However, the "edges" table has been lost, and needs to be re-created to solve this particular issue.

linas commented 6 years ago

The correct long-term fix for this is probably #1502

vsbogd commented 5 years ago

Options (b) and probably (c) all have a huge drawback: One of the important queries is "given this outgoing set, find me the atom that contains it" This is a must-have query, its fundamental.

@linas, do you mean that query you mentioned is required to not write link duplicates into database? In other words to incrementally add atoms into database?

linas commented 5 years ago

It would probably be best to close this issue, and open a new one with the same title. There's too much stale discussion in here.

@linas, do you mean that ...

I don't understand the question.

vsbogd commented 5 years ago

Ok, thanks @linas, I see. I didn't use this persistence functionality before it is the reason why I asked such obvious thing.

linas commented 5 years ago

Ah, your welcome. It's great for when your dataset doesn't fit in RAM. Also for when you want to have multiple machines work on the same dataset (since network distribution works "automatically")

linas commented 1 year ago

Closing, obsoleted by issue #3018 -- in short, the RocksStorageNode supports SetLinks of arbitrary size. The PosttgresStorageNode is old and flaky and slow; it would get smaller, simpler & faster simply by cut-n-pasting the RocksStorageNode code and then tweaking it to use Postgres instead of Rocks as the backend.