Even trivial links of height 20 have a degenerate SQL fetch with SEGFAULT

inflector commented 8 years ago

After implementing storing into an Edges table so AtomSpace could support arbitrary sized links, I implemented a test for deep links, i.e. links with tree-depth (or height) of 1,000 to 10,000 links. Turns out you can't do that with our current fetch_atom algorithm. It breaks at about 20.

Pseudo-code for this node creation algorithm which SEGFAULTs at 20:

create node_1
create node_2
first_out = node_1
second_out = node_2

for 0 to depth:
    link = create_link(LINK, {first_out, second_out})
    link_uuids.add(link.uuid)
    second_out = first_out
    first_out = link

last_link = first_out

# Fetch Last
atomspace.fetch(last_link)

# Fetch All
for link in link_uuids:
    atomspace.fetch(link)

The unit test prints the SQL query counts. It segfaults at depth 20.

Here is a table showing the queries (the Load uses the optimized height-based loading from the bottom up which is what gets called when you load an atomspace in scheme):

Depth	Fetch Last Queries	Fetch All Queries	Load Queries
15	3,208	8,341	32
16	5,183	13,508	34
17	8,378	21,869	36
18	13,547	35,398	38
19	21,910	57,289	40

So the height-based loading is much much much better. And fetch? Won't work for deep links.

I'll fix this, but I wanted to check in the fix for the last bug before doing so, and start a discussion of reasonable cache sizes and options for fixing this degenerate case and leave a reminder to fix this issue.

linas commented 8 years ago

Let me make a wild guess as to the crash:

when you go to insert a link in the atomspace, it tries to make sure that the children are already in the atomspace. It does this by making recursive call. If each stack frame is 3KBytes and you have a stack that is 64KBytes deep, then you are probably blowing out your stack.

inflector commented 8 years ago

That's certainly the source of the blowup. The issue is that all the lower trees are loaded again and again and again. It never stores any of the intermediate values. This wasn't obvious to me as I was looking at the code until I saw this issue. It's pretty subtle what's going one. It queries for every atom at every level again, even though they were already retrieved. It's clear from the behavior that something like this is happening but not where and why exactly as it seems like the code will only fetch each atom once.

I'm not sure where the segfault is, but I didn't look yet. What concerned me was the time it took to do it as that's likely to affect much smaller cases where there are large numbers of atoms, i.e. shallower but wider trees because the algorithm acts O(depth^arity), so a depth 6 link with 10 mean arity would blow up too.

Here's the code that's causing the problems (with extraneous parts excluded). From Handle AtomSpace::fetch_atom(UUID uuid):

    // Case 1:
    Handle hb(_atom_table.getHandle(uuid));
    if (_atom_table.holds(hb))
        return hb;

    // Case 2:
    // We don't have the atom for this UUID, then go get it.
    AtomPtr a(_backing_store->getAtom(uuid));

    return _atom_table.add(a, false);

Note here from AtomPtr PGAtomStorage::getAtom(UUID uuid) which includes the comment, This method does not register the atom with any atom table.:

/**
 * Create a new atom, retrieved from storage. This is a recursive-get;
 * if the atom is a link, then the outgoing set will be fetched too.
 * This may not be efficient, if you only wanted to get the latest TV
 * for an existing atom!
 *
 * This method does *not* register the atom with any atomtable.
 */

   PseudoPtr p(load_pseudo_atom_with_uuid(uuid));

    HandleSeq oset;
    for (UUID idu : p->oset)
        oset.emplace_back(getAtom(idu)->getHandle());

    LinkPtr link(createLink(p->type, oset, p->tv));
    setAtomUUID( link, p->uuid );
    return link;

The blowup in SQL queries appears to be caused the properties of the recursion happening inside the backing store's getAtom call while the atoms are not added to the atom table during recursion, only at the very end where the top atom is added. So the bottom levels of the tree are loading again. It seems in looking at it that they'd only load once and you'd have the atom, but that's not what's going on. I'll put in some debugging traces for smaller cases and see what's really going on.

I don't know how common cases near this are right now, I need to look at some real world examples.

Can you suggest a workflow for building some typical large atoms? The nlp code?

This is actually one case where the Edges table really could potentially really help, especially if we stored the depth in the Edge. We used a similar scheme to handle data distribution for a massively distributed sales automation system for Nortel in the mid 90s. We'd flatten the table so that you could pull in all the atoms with one query. It takes a bit more space, but not more time if you do the flattening in another thread or process and you use it to load into a cache and have the case handled where the cache doesn't contain an atom which handles case where the flattening hasn't yet happened.

The real question is how important is single atom fetch performance?

inflector commented 8 years ago

Here's a clue for a depth of 5:

getAtom(59)
    getAtom(58)
        getAtom(57)
            getAtom(56)
                getAtom(55)
                    getAtom(53)
                    getAtom(54)
                getAtom(53)
            getAtom(55)
                getAtom(53)
                getAtom(54)
        getAtom(56)
            getAtom(55)
                getAtom(53)
                getAtom(54)
            getAtom(53)
    getAtom(57)
        getAtom(56)
            getAtom(55)
                getAtom(53)
                getAtom(54)
            getAtom(53)
        getAtom(55)
            getAtom(53)
            getAtom(54)

linas commented 8 years ago

I can't reply in depth just right now, so a few short remarks:

-- "queried again" -- the truth value may have changed between last time and this time, so one must query whenever one is told to query. (in the multi-user scenario --- in a single-user scenario, it can't happen)

-- it is impossible to fully automate caching and fetch at the lowest layers, because one does not know what the highest layers are doing. If the highest layers said "fetch again", one must assume that is exactly what the higher layer wanted. You can't second-guess the higher layers.

-- however, atom-insertion into atom table, and atom fetching, are atomic. If a link has N children, and you fetch the link, there should be at most N+1 fetches, and anything else is broken. That is, fetching is a batch job.

-- some rainy day in the future, maybe we could add a "fetch-batch" API, where you give it M atoms, so that if they have N children combined, at most N+M atoms are fetched.

linas commented 8 years ago

I mis-spoke: "in a single-user scenario, it can't happen" this is wrong -- one thread might be doing stuff that other threads are ignorant of, including storing and deleting atoms, while another thread is fetching the same atoms ... they may wink in and out of the atomspace over time

linas commented 8 years ago

that is, the multi-thread case is a lot like the multi-user case.

inflector commented 8 years ago

If a link has N children, and you fetch the link, there should be at most N+1 fetches, and anything else is broken.

This is the behavior I'm fixing. For the above case, we have a link with 6 children, so there should be 7 queries (each getLink calls load_pseudo_atom_with_uuid which does a straight SQL call). There are 28 SQL queries to fetch 7 atoms in this case.

inflector commented 8 years ago

And it goes up approaching N^2 for this case.

inflector commented 8 years ago

I mean approaching 2^N which is obviously much worse.

linas commented 8 years ago

For the langage-learning code, I am pretty sure that no link is more than 5 deep. There may be some crazy BindLinks that are 7 or 8 deep, but those would never be normally stored in the database. The robot behavior code also never gets more than 5-6 deep, but I suppose someone might go wild and create structures that are significantly deeper -- say, up to 20. The robot behavior code uses DefineLink's to keep the trees relatively flat: if the DefineLinks didn't exist, than some of the nesting may get up to 12 or 16 deep. Hard to say.

At this time, I can't imagine any atom more than 20 deep. No, this is not like saying that I can't imagine why a compute would ever need more than 64KB ...

inflector commented 8 years ago

This is fixed with #725

opencog / atomspace

Even trivial links of height 20 have a degenerate SQL fetch with SEGFAULT #698