opencog / atomspace

The OpenCog (hyper-)graph database and graph rewriting system
https://wiki.opencog.org/w/AtomSpace
Other
820 stars 232 forks source link

Use concurrent (aka "lockless") hash map in the AtomTable #2553

Open linas opened 4 years ago

linas commented 4 years ago

The AtomTable uses a mutex to guard access to the TypeIndex. This mutex could be mostly avoided by using a concurrent hash map.

AtomTable Status: After exploring this for a while, there's a problem. It seems like what AtomTable/Typeindex really requires is a concurrent unordered multimap, with thread-safe erasure. There are 5 or 6 packages that provide concurrent hash maps, but only one provides a concurrent multimap: Intel TBB. Unfortunately, it does NOT provide a thread-safe erase. So it seems like a dead-end! There's still one possibility: use a concurrent hash map, and store concurrent linked lists as the value. Yikes! Fixed in #2907

Atom Status: The Atom (Atom.cc) stores Values in std::map. It does NOT use a hash table, in order to keep Atoms as small as possible. Although it could be replaced by a lockless hash table, this risks blowing up RAM usage. So: how bad would this be? How would this change the size of an Atom? Atom also uses an ordinary std::set for the incoming set .... same as before: we want a tree here, not a hash, to keep the Atom as small as possible...

Overall status: This issue primarily affects users with highly threaded workloads, on modern multi-core (more than 8 core) machines. It appears that the mutexes are NOT the primary bottleneck; instead, its the CPU implementation of atomic increment. See issue #2758 for discussion. Conclude: a lockfree find() in a hashset could speed things up for truly demanding users, but we don't have users like that.

Lockfree comments: Notes below review the available options. We need two things: a lockfree hash set (for TypeIndex), and a lockfree tree (i.e. something smaller, less RAM intensive than a hash set, for the Incoming set). Current options are thin, approximating zero: most of them are hard to use or fail to meet requirements. Lockfree theory looks strong; drop-in replacements for std::set and std::unordered_set are missing. (fb folly comes close) Documentation and benchmarks are lacking or inadequate. This is still the bleeding edge.

linas commented 4 years ago

Implementing this requires

Do the same for Atom.cc

linas commented 4 years ago

Possibilities:

Each seems to have limitations and usability issues... See below for some commentary.

Starting with the simplest one, even if it is not the fastest, is probably the best/safest idea.

linas commented 3 years ago

This is important work: the current scheme bindings have a great amount of trouble parallelizing, and I think that this is due to lock collisions in the AtomTable.

No. See issue #2758 Almost all the parallelism problems are due to the CPU implementation of atomic increment. There could be some speedup from a lock-free TypeIndex (at least if the set find() is made lock-free), but this now seems as not being as high-priority as I first imagined.

linas commented 3 years ago

What is actually needed is a drop-in replacement for std::unordered_multimap and none of the solutions above provide that. However, Intel TBB does seem to provide that! However, the TBB version does not provide a thread-safe erase(). And we need that.

Comments about the concurrent hash maps, mentioned above.

linas commented 2 years ago

Background reading:

linas commented 2 years ago

Idea: in TypeIndex.h, instead of declaring

typedef std::unordered_multimap<ContentHash, Handle> AtomSet;

do NOT use ConentHash, but use the string s-exp of the Atom instead! This now becomes a plain map, instead of a multimap. .. except that the string version cannot deal with alpha-equivalent expressions. Bummer.

This could be solved by having a flag for each atom: "contains alpha terms". if set to true, fallback to a lock plus content-hash. else use the lock-free lookup! That might work!

It would require a fast atom->short-string function, however. (This could be made fast by storing substrings in RAM, but this blows up RAM usage even more.)

FIXED in #2902 the multimap design was insane.

linas commented 2 years ago

In TypeIndex.h, there was no reason for having an unordered multimap. I removed it in pull req #2902 replacing with with an unordered set. This now opens the door for the concurrent data structures. The TypeIndex set replacement needs to support:

After the cleanup of #2907 (and the four prior pull reqs) experimenting with alternatives is now possible.

linas commented 2 years ago

Confusion reigns: Recompiled both Atom.h and TypeIndex.h with the locks completely disabled. Then ran the opencog/benchmark bio-loop3.scm benchmark and saw only minor (almost un-noticeable) parallelism improvements. So whatever is causing the lock contention, its not these two.

Double-check these results: using folly:ConcurrentHashmap in TypeIndex (and removing the locks) also has no effect on parallelism.

linas commented 2 years ago

Some mutrace experimental results.

For the locked version: the lock in TypeIndex::findAtom() is hit 19 million times, and this accounts for almost all of the unserializable time in the parallel bio-loop3.scm benchmark. Most of the calls are coming from PatternMatchEngine.cc:1337, this snippet of code:

         // Yuck. What we really want to do here is to find out
         // if `Link(t, oset)` is in the incoming set of `hg`. But
         // there isn't any direct way of doing this (at this time).
         Handle hup(_pmc.get_link(hg, t, std::move(oset)));

and pmc.get_link() just calls as->get_link() which calls TypeIndex::findAtom()

The unlocked versions show no lock contention. Conclude: it is not lock contention that is slowing us down.

Latest theory: The smart pointers and other assorted atomics in the code are used so frequently, that they are causing cache-line contention for the locks. (CPU's implement atomics inside of cache-lines.) Bingo! That's exactly it! Things parallelize nicely on modern CPU's! See issue #2758 for a general discussion.