tweag / nixpkgs-graph-explorer

Explore the nixpkgs dependency graph
MIT License
15 stars 0 forks source link

Consider swapping out Gremlin back-end #49

Open dorranh opened 1 year ago

dorranh commented 1 year ago

We currently use sqlg as our Gremlin Server "back-end" which has worked well for prototyping so far. However, the version of sqlg (2.1.6) we are using depends on an older version of Tinkerpop (3.5.1) which is slowly becoming out of date. I tried upgrading to sqlg (3.0.0) yesterday, but it appears the sqlg is slowly moving away from supporting Gremlin Server in favor of having people import sqlg in a JVM application and use it as a library (that is at least its primary use case). sqlg has also upgraded to Java 17, but Tinkerpop does not fully support Java 17, and it seems that they are a ways from being able to upgrade to it.

I think it could be useful to explore alternate options for the Gremlin back-end to better future-proof nixpkgs-graph-explorer. Since we only interface with Gremlin Server itself this can be pretty freely swapped out without requiring code changes.

One alternative could be to use Janusgraph with one of its open source back-ends, e.g. ScyllaDB (docs). The downside to this is that by switching databases to something more specialized we narrow the available options for managed services.

GuillaumeDesforges commented 1 year ago

We've faced performance issues using sqlg, which is another big motivation to swap the gremlin backend.

GuillaumeDesforges commented 1 year ago

Another idea I had would be to remove the graph database all together in favor of a document-based approach.

My reasoning is as follows: our matter is serving the graph of derivations of a derivation. We could build the whole closure of each derivation and store it as a document: each derivation has its own document. The reason this works is because extracting Nix data is idempotent/reproducible so data structures are immutable. Thus even if we need to update the data model (e.g. we add new fields), we can re-run and erase overwrite.

Serving the closure of a derivation would then be straightforward: we return exactly the document of the derivation, no query resolution needed.

The overhead is the creation of documents, which should be manageable.

We could implement that quickly as JSON files that we just store on some storage (thanks to fsspec we could have the storage basically anywhere, on the server HDD or in the cloud).