quadstorejs / quadstore

A LevelDB-backed graph database for JS runtimes (Node.js, Deno, browsers, ...) supporting SPARQL queries and the RDF/JS interface.
https://github.com/quadstorejs/quadstore
MIT License
203 stars 14 forks source link

BACKEND: Bypass literal indexing? #119

Closed gsvarovsky closed 3 years ago

gsvarovsky commented 3 years ago

I'd like to discuss alternatives to solve a performance issue I have. I would be very happy to submit a PR once we have decided the best approach, or work with you. Quadstore carefully balances performance with features (great job!), and this should enhance it further.

Root Problems

  1. IndexedDB performance is terrible. Irrespective of batching, I observe individual key writes at >1ms (including transaction commit). Using 6 indexes, it only takes 16 triples to reach a 100ms transaction time.
  2. With the best literal encoding possible, indexing means that a literal value is repeated multiple times in storage.

General Affected Scenarios

  1. Storage of moderate to high numbers of literals that do not require indexing.
  2. Storage of moderately sized to large literals, e.g. string or binary.

My Particular Scenarios

Journaling

m-ld uses one graph for the user data, and another graph for journaling. The journal is an append-only log of transactions, to support recovery of peer clones. I can truncate the journal periodically, but in domain with very volatile data, this would have to be unacceptably aggressive.

Not all properties of a journal entry need to be indexed in the graph for querying. In fact there is only one property that is ever matched in a query.

Note that it is necessary, for data safety, to update the journal atomically with user data updates.

User Scenarios

Trivially, as m-ld exposes a graph API, my users' data can be affected by the same issues.

Current Workarounds

  1. I customise the indexes (reduce from 6 to 3) because I never query across the user data and the journal data (the graph is always included in the match pattern).
  2. I bundle some properties of the journal into a base64Binary literal containing a JSON object. This helps by reducing the count of triples in each transaction. However this does not help with storage volume, because the bundled properties are still repeated in the graph indexes.

My Proposed Alternative Solutions

Non-Quad Data

We could add a way to include non-quad key-value pairs in the batch write APIs. This would allow an API user to perform additional data persistence operations atomically with quad writes, while applying their own optimisations.

Opinion: I think this could be an unacceptable rupture in quadstore's data abstraction. Also, it would not help me with User Scenarios unless I also broke my own abstraction!

Omit Indexing for Some Literals

We could add a configurable way to disable indexing for selected literals. Selection could be configured by data type (e.g. base64Binary/hexBinary), and/or by data size (e.g. >32 bytes), or any selector based on a quad literal. Affected literal values would only appear once, in a backend value. In trade-off, the value would not be queryable, only retrievable.

Opinion: this is my preferred approach, although I'm not fully sure yet how to implement it.

Use Hash Indexes for Some Literals

We could add a configurable way to use hash values for indexing of selected literals. As above, by any selector based on quad literal. The unhashed value would only appear once, in a backend value. It would still be possible to exact-match a literal, but not range match.

Note that the hash could be strong (cryptographic, slow) or weak (fast). The latter would require a post-filter on query results to remove hash-colliding values.

Opinion: the complexity of this option is off-putting. It feels like there would have to be quite a bit of analysis of the trade-offs. However, it is possible that m-ld may require this option in future anyway (for reasons that would need another essay!).

jacoscaz commented 3 years ago

Hello @gsvarovsky !

IndexedDB performance is terrible. Irrespective of batching, I observe individual key writes at >1ms

That is definitely unexpected and kinda worrying, even more so if batching doesn't seem to affect these latencies. On one side, it reminds me that we should really add a way to run unit and performance tests in the browser automatically to keep track of improvements and regressions. On the other side, perhaps we should have a closer look potential alternatives for browser-side storage. Or, storage alternatives in general. Conceptually, quadstore evolved from the approach first taken by levelgraph but there's no reason we must stick to such approach if something better comes along. Similarly, we could use something else than level-js.

Storage of moderate to high numbers of literals that do not require indexing. Storage of moderately sized to large literals, e.g. string or binary.

Interesting use cases!

Non-Quad Data

Breaking the abstraction does not sound that good to me, too, but I would still like to consider it as it might be the simplest and most performant option.

Omit Indexing for Some Literals

quadstore does not index single terms but entire quads. Presumably, and I'm just brainstorming here, disabling indexing for a given kind of literals would mean forcing quadstore to add quads matching the criteria only to a subset of its configured indexes. This might slightly lower the latencies you are experiencing but you've already played with dropping indexes; I don't think this would fare any better than that.

Use Hash Indexes for Some Literals

Funny that you mention this as a while ago I experimented with a generalized version of the approach you describe that would deduplicate all terms, basically leading quads to become tuples of references to a terms index. A smart use of (configurable) caching helped in limiting the performance impact of having lookups-on-write. However, the big issue here is the complexity of implementing multi-filter queries, which quickly leads back to per-quad indexing thus making the whole effort kinda moot.

I wonder if we could apply this to literals only: we could have a literals index with keys being length-limited, lossy, sortable encodings of literals which we would also use in place of the actual literal representation in the standard indexes. However, note that this would probably eliminate the issue with repeating multiple large literals but would actually worsen write performance as you'd have an additional index on top of the others. Keeping such index in sync with the others might also turn out to be a non-trivial task.

All in all, it sounds to me like the root problems, literal duplication and write performance, must be addressed separately.

The latter should probably be addressed with a replacement for level-js, perhaps something that writes to memory and then dumps the whole thing to localStorage (snapshotting?).

The former should probably be addressed via a dedicated index, perhaps even limited to literals exceeding a certain length.

gsvarovsky commented 3 years ago

Thanks for considering, @jacoscaz! OK, considering the literal duplication first.

Here is an idea for a bit more detail on the "omit indexing" option. For brevity, I'll only consider my 3 indexes. At present, a quad <<S P O G>> looks like this in the backend:

G,S,P,O : <<S P O G>>
G,O,S,P : <<S P O G>>
G,P,O,S : <<S P O G>>

The solution involves noticing the presence of the binary literal in O when writing the quad, and truncating the index representation suitably, thus:

G,S,P,  : <<S P O G>>
G,      : <<S P O G>>
G,P,    : <<S P O G>>

Then notice that G, is redundant and remove that too. A similar operation on the default 6 indexes reduces them to 4. (This also addresses write performance, a bit.)

When matching, the inclusion of a binary literal in the match pattern simply errors immediately. Otherwise, I have the possible index selection options (noting that I always include a Graph term if anything else is present, hence my index choice):

Pattern Index
_,_,_,_ G,S,P,O (first found)
G,S,_,_ G,S,P,O
G,S,P,_ G,S,P,O
G,_,P,_ G,P,O,S
gsvarovsky commented 3 years ago

With regard to dropping level-js: personally I need IndexedDB because I need atomic transactions, which would be very hard to do in the other common browser option, local storage. WebSQL is very poorly supported. So the only motivation would be if level-js has some problems itself.

According to some reports, the batch approach in level-js, which waits for each put or del to complete before submitting the next one, is the wrong thing to do. But I've experimented with that and doing them all at once only makes a tiny difference to the whole transaction duration.

jacoscaz commented 3 years ago

Hello again!

WRT omitting indexing, I fear truncating index representations would lead to issues with quads overriding one another. Let's assume index GSP and quads (s1, p1, o1, g1) and (s1, p1, o2, g1). Both of these quads would be represented as something like g1-s1-p1, leading the store to replace the one indexed first with that indexed later on. This is the reason why quadstore does not support partial indexes. I suspect there's a way to counter this by adding some sort of hashed version of the entire quad at the end of the index representation but that would considerably slow down insertion, more so with large literals involved.

One relatively simple and very flexible feature we could add is the option of passing completely arbitrary (key, value) pairs together with quads in all relevant operations (put, multiPut, patch, multiPatch) and new dedicated methods to retrieve those values. Projects such as m-ld would then be able to use this as the foundation for custom functionality. Does this sound like something that could help?

jacoscaz commented 3 years ago

WRT performance, yes, reimplementing atomic transactions would definitely be a non-trivial endeavor.

I can't wrap my mind around the numbers you've encountered; I routinely see 15k quad/second in my imports and that translates to 15 quad/millisecond, meaning that going through level-js and IndexedDB is resulting in whole orders of magnitude of difference. Could you run the dist/perf/loadfile.js script with the 21million.rdf file on the same machine and report what you get?

cat 21million.rdf | head -n 1000000 > 1million.rdf
node dist/perf/loadfile.js ./1million.rdf
jacoscaz commented 3 years ago

Have you tried profiling in the browser to figure out what functions are taking the most time?

gsvarovsky commented 3 years ago

Have you tried profiling in the browser to figure out what functions are taking the most time?

Yes, that's actually where I'm getting my timings from, in Chrome. It appears to be faster in Firefox but the performance tab is not as easy to parse.

gsvarovsky commented 3 years ago
node dist/perf/loadfile.js ./1million.rdf
TIME: 104913 s
DISK: 524912
gsvarovsky commented 3 years ago

truncating index representations would lead to issues with quads overriding one another

Ah, of course, sorry. We could proceed by only hashing the selected big literals in the indexes, not the whole quad. It has the advantage that the big literals are then findable by exact-match. (This means this proposal actually becomes "Use Hash Indexes".) No need to hash anything that isn't a big literal.

If H is the the hash of O:

G,S,P,H : <<S P O G>>
G,H,S,P : <<S P O G>>
G,P,H,S : <<S P O G>>

I've been looking at SubtleCrypto in the browser and the indications are it's really fast. The trade-off of storage against CPU would be tune-able by configuration.

Looking at this might also help with #95, if you treat a quad as analogous to a 'big literal' (although, not straightforward for matching inside the nested triple, need a join).

The arbitrary key-value pairs would not help a user of mine with a big literal value, and it's quite likely with text fields.

gsvarovsky commented 3 years ago

Just thinking about this a little more. With an arbitrary key-value interface I could implement my own hash-based index, and only reveal the hash values to quadstore, instead of the original literals. That could be used to remove all duplication.

By the way, I am already hashing all (user) triples anyway, for my own purposes (which is why I'm looking at SubtleCrypto).

jacoscaz commented 3 years ago

That could be used to remove all duplication.

Yeah, I that’s what I had in mind. However, I also agree that hashing exceedingly long terms would be a good feature.

These two are probably more orthogonal than they are overlapping. I guess you’d be even better off with both?

jacoscaz commented 3 years ago

I guess I would start with the key/value interface as I find that more generic and flexible. Should that turn out not to work, I would then move to implement hashing directly in the store. What do you think?

jacoscaz commented 3 years ago

We've just merged atomic writes of quads coupled with custom key, value pairs (see #120). @gsvarovsky how are things looking on your end? Do you already have an idea of whether that will be enough to address the duplication issues you are experiencing with m-ld?

gsvarovsky commented 3 years ago

As expected, using raw KVPs instead of quads for my journal has removed all duplication overhead in storage. 💯

  1. Very strangely, and unhappily, I'm still seeing very similar total batch time in IndexedDB for a user transaction including (atomic) journaling. I'll be looking into that in more detail.
  2. I have not yet attempted to implement a hash index in order to extend the benefit to user quads, but this is a lower priority for me and I will consider it later.
jacoscaz commented 3 years ago

As expected, using raw KVPs instead of quads for my journal has removed all duplication overhead in storage. 💯

That's great to hear!

Very strangely, and unhappily, I'm still seeing very similar total batch time in IndexedDB for a user transaction including (atomic) journaling. I'll be looking into that in more detail.

Having done some testing on IndexedDB in the last few days, that doesn't come across as terribly surprising (unfortunately). In my testing, which admittedly is still rather limited, the relative ordering of the keys being added to the store shows a high correlation to the overall latency of the batch. Writing values with keys sorted lexicographically (i.e. aaa, aaab, aaac, aba, abc, abf, ...) seems to be a lot faster than writing values with sparse/unsorted keys - particularly in Chrome.

Unfortunately, due to how our indexing works, I don't think there would be way to fix this without switching to non-atomic writes, which is obviously a deal-breaker in almost every use case. I really need to find a way to run those performance tests on the browser side but I do think that switching to an AOF + DUMP persistence model (like redis does) instead of hitting IndexedDB directly might be the only way forward in cases of real bad performance.

I've opened a dedicated issue for browser-side performance (#121 ) and will now close this one. Please do comment with your use case and as detailed as description as you can of your symptoms and findings so far.

I have not yet attempted to implement a hash index in order to extend the benefit to user quads, but this is a lower priority for me and I will consider it later.

Sounds good!