Closed inflector closed 8 years ago
Let me make a wild guess as to the crash:
when you go to insert a link in the atomspace, it tries to make sure that the children are already in the atomspace. It does this by making recursive call. If each stack frame is 3KBytes and you have a stack that is 64KBytes deep, then you are probably blowing out your stack.
That's certainly the source of the blowup. The issue is that all the lower trees are loaded again and again and again. It never stores any of the intermediate values. This wasn't obvious to me as I was looking at the code until I saw this issue. It's pretty subtle what's going one. It queries for every atom at every level again, even though they were already retrieved. It's clear from the behavior that something like this is happening but not where and why exactly as it seems like the code will only fetch each atom once.
I'm not sure where the segfault is, but I didn't look yet. What concerned me was the time it took to do it as that's likely to affect much smaller cases where there are large numbers of atoms, i.e. shallower but wider trees because the algorithm acts O(depth^arity), so a depth 6 link with 10 mean arity would blow up too.
Here's the code that's causing the problems (with extraneous parts excluded). From
Handle AtomSpace::fetch_atom(UUID uuid)
:
// Case 1:
Handle hb(_atom_table.getHandle(uuid));
if (_atom_table.holds(hb))
return hb;
// Case 2:
// We don't have the atom for this UUID, then go get it.
AtomPtr a(_backing_store->getAtom(uuid));
return _atom_table.add(a, false);
Note here from AtomPtr PGAtomStorage::getAtom(UUID uuid)
which includes the comment,
This method does not register the atom with any atom table.:
/**
* Create a new atom, retrieved from storage. This is a recursive-get;
* if the atom is a link, then the outgoing set will be fetched too.
* This may not be efficient, if you only wanted to get the latest TV
* for an existing atom!
*
* This method does *not* register the atom with any atomtable.
*/
PseudoPtr p(load_pseudo_atom_with_uuid(uuid));
HandleSeq oset;
for (UUID idu : p->oset)
oset.emplace_back(getAtom(idu)->getHandle());
LinkPtr link(createLink(p->type, oset, p->tv));
setAtomUUID( link, p->uuid );
return link;
The blowup in SQL queries appears to be caused the properties of the recursion happening inside the backing store's getAtom
call while the atoms are not added to the atom table during recursion, only at the very end where the top atom is added. So the bottom levels of the tree are loading again. It seems in looking at it that they'd only load once and you'd have the atom, but that's not what's going on. I'll put in some debugging traces for smaller cases and see what's really going on.
I don't know how common cases near this are right now, I need to look at some real world examples.
Can you suggest a workflow for building some typical large atoms? The nlp code?
This is actually one case where the Edges table really could potentially really help, especially if we stored the depth in the Edge. We used a similar scheme to handle data distribution for a massively distributed sales automation system for Nortel in the mid 90s. We'd flatten the table so that you could pull in all the atoms with one query. It takes a bit more space, but not more time if you do the flattening in another thread or process and you use it to load into a cache and have the case handled where the cache doesn't contain an atom which handles case where the flattening hasn't yet happened.
The real question is how important is single atom fetch performance?
Here's a clue for a depth of 5:
getAtom(59)
getAtom(58)
getAtom(57)
getAtom(56)
getAtom(55)
getAtom(53)
getAtom(54)
getAtom(53)
getAtom(55)
getAtom(53)
getAtom(54)
getAtom(56)
getAtom(55)
getAtom(53)
getAtom(54)
getAtom(53)
getAtom(57)
getAtom(56)
getAtom(55)
getAtom(53)
getAtom(54)
getAtom(53)
getAtom(55)
getAtom(53)
getAtom(54)
I can't reply in depth just right now, so a few short remarks:
-- "queried again" -- the truth value may have changed between last time and this time, so one must query whenever one is told to query. (in the multi-user scenario --- in a single-user scenario, it can't happen)
-- it is impossible to fully automate caching and fetch at the lowest layers, because one does not know what the highest layers are doing. If the highest layers said "fetch again", one must assume that is exactly what the higher layer wanted. You can't second-guess the higher layers.
-- however, atom-insertion into atom table, and atom fetching, are atomic. If a link has N children, and you fetch the link, there should be at most N+1 fetches, and anything else is broken. That is, fetching is a batch job.
-- some rainy day in the future, maybe we could add a "fetch-batch" API, where you give it M atoms, so that if they have N children combined, at most N+M atoms are fetched.
I mis-spoke: "in a single-user scenario, it can't happen" this is wrong -- one thread might be doing stuff that other threads are ignorant of, including storing and deleting atoms, while another thread is fetching the same atoms ... they may wink in and out of the atomspace over time
that is, the multi-thread case is a lot like the multi-user case.
If a link has N children, and you fetch the link, there should be at most N+1 fetches, and anything else is broken.
This is the behavior I'm fixing. For the above case, we have a link with 6 children, so there should be 7 queries (each getLink calls load_pseudo_atom_with_uuid
which does a straight SQL call). There are 28 SQL queries to fetch 7 atoms in this case.
And it goes up approaching N^2 for this case.
I mean approaching 2^N which is obviously much worse.
For the langage-learning code, I am pretty sure that no link is more than 5 deep. There may be some crazy BindLinks that are 7 or 8 deep, but those would never be normally stored in the database. The robot behavior code also never gets more than 5-6 deep, but I suppose someone might go wild and create structures that are significantly deeper -- say, up to 20. The robot behavior code uses DefineLink's to keep the trees relatively flat: if the DefineLinks didn't exist, than some of the nesting may get up to 12 or 16 deep. Hard to say.
At this time, I can't imagine any atom more than 20 deep. No, this is not like saying that I can't imagine why a compute would ever need more than 64KB ...
This is fixed with #725
After implementing storing into an Edges table so AtomSpace could support arbitrary sized links, I implemented a test for deep links, i.e. links with tree-depth (or height) of 1,000 to 10,000 links. Turns out you can't do that with our current
fetch_atom
algorithm. It breaks at about 20.Pseudo-code for this node creation algorithm which SEGFAULTs at 20:
The unit test prints the SQL query counts. It segfaults at depth 20.
Here is a table showing the queries (the Load uses the optimized height-based loading from the bottom up which is what gets called when you load an atomspace in scheme):
So the height-based loading is much much much better. And fetch? Won't work for deep links.
I'll fix this, but I wanted to check in the fix for the last bug before doing so, and start a discussion of reasonable cache sizes and options for fixing this degenerate case and leave a reminder to fix this issue.