quadstorejs / quadstore

A LevelDB-backed graph database for JS runtimes (Node.js, Deno, browsers, ...) supporting SPARQL queries and the RDF/JS interface.
https://github.com/quadstorejs/quadstore
MIT License
203 stars 14 forks source link

Optional IRI compaction in storage #91

Closed gsvarovsky closed 4 years ago

gsvarovsky commented 4 years ago

This is a speculative PR for the addition of a hook to allow compaction of IRIs in the storage back-end.

This originated in an attempt to improve performance when using Level.js in the browser. Actually it made no difference to transaction duration. However, it did reduce storage volume by about 40% in my use-case, and it may be useful to others.

The principle is as follows. When an RdfStore is created, you can optionally include a prefixes object, which is required to have two methods:

  1. expandTerm(term: string, expandVocab?: boolean): string, which is used to expand an IRI from what is stored.
  2. compactIri(iri: string, vocab?: boolean): string, which is used to compact an IRI ready for storage.

These methods intentionally follow the contract of the same methods in the Parser object from https://github.com/rubensworks/jsonld-context-parser.js (also convenient for me as I'm using JSON-LD extensively).

Note that I also removed some unused Term import/export methods, which were anyway generally incompatible with my approach since they did not specify the term position (S, P, or O).

If this is of interest, I'd be happy to add tests and documentation. However I'm aware that a big refactor is in the works, so I didn't want to go too far.

jacoscaz commented 4 years ago

@gsvarovsky thank you for your wonderful work. Yes, a big refactor is indeed in the works and I would be glad to merge this into what will come out of that. If you're ok with this, I would keep this PR open and get back to it in a bit, once I get around to consolidate version 7.0.0.

jacoscaz commented 4 years ago

@gsvarovsky if you're feeling brave, you might want to start having a look at the latest commit on the devel branch and rebase this one accordingly. Be aware that there are quite a few breaking changes which I have not had the time to document yet.

gsvarovsky commented 4 years ago

@jacoscaz not going to be possible this month, but I am very interested by the new query support so I will be inspecting the new branch as soon as I can. Hopefully it will be possible to rebase then. I'll let you know when that is! Cheers

jacoscaz commented 4 years ago

@gsvarovsky following up on my previous comment, the devel branch should have reached a decent level of API stability and the README.md file now includes almost all of the changes we have introduced in the codebase.

jacoscaz commented 4 years ago

@gsvarovsky following up on my previous comments, the master branch is now the active branch and API stability has improved further. I suggest rebasing on master.

gsvarovsky commented 4 years ago

Hi @jacoscaz I've (finally) got round to rebasing this PR. Points to note:

  1. To save passing the prefixes object around, I've packaged the serialisation functions into a class with the prefixes as a member. This makes for an exciting diff for serialization.ts, but only a few lines have changed materially. I suggest to use Hide Whitespace Changes.
  2. The necessity to handle SPARQL means we can't always know the quad position of a term. So I've removed the vocab parameter which used to apply only to predicates and types.

I still haven't added tests/docs, pending your review.

jacoscaz commented 4 years ago

@gsvarovsky thank you for your wonderful work. I'll do my best to review this as soon as possible. It looks good from a cursory look.

As for when to release this, you might have noticed that I have released a bunch of alpha versions to NPM. Once this is merged I would release another such version to NPM and then continue working on #103, #104 and #105 , which should take us to beta and then the long-awaited 7.0.0. Would this be ok with you?

gsvarovsky commented 4 years ago

I have a few more minor edits incoming, now that I have tested this in my code. Sorry that I didn't warn you. I will commit them now and annotate the changes.

jacoscaz commented 4 years ago

Further to my comments, could you please add a mention to the prefixes object in the README.md file? I'll merge this later this evening.

gsvarovsky commented 4 years ago

Just to be very clear: as mentioned in the description, this PR is speculative, for your consideration. Personally I am a little bit nervous that I am increasing your project's surface area (your API), without a big benefit – it does not make any difference to the transaction speed, only to storage, and that depends on the prefixes used. Up to you, of course.

If you decide we should go ahead, I will add some tests.

jacoscaz commented 4 years ago

I guess I should have said this in the first place - we should definitely go ahead with this!

A 40% reduction in storage usage with such a small addition to the API surface is definitely worth it in my books. Storage usage is one of quadstore's pain points when dealing with large datasets and, even though it wasn't designed with that kind of use cases in mind, getting better at handling that end of the spectrum is nothing to sneer at.

This does raise the question of how to write performant compaction / expansion functions. I suspect using a fixed prefix length might help. That and maybe procedurally generating the relevant JS code rather than loop on maps or arrays of prefixes. But, this is a different matter for a different time. For now, let's go ahead with this.

jacoscaz commented 4 years ago

Nice work!

jacoscaz commented 4 years ago

Published to NPM as quadstore@7.0.1-alpha.8!