[OEP 18] New Graph Model

lvca commented 7 years ago

Summary: Create a new graph model that better suites cases with massive insertion than the current one. It will be not a replacement, but rather an alternative to choose based on the OrientDB's Multi-Model philosophy to be flexible enough to adapt to many different use cases.

Goals: 100% compatible with the TinkerPop Graph API (like the existent one) but faster on insertion.

Non-Goals: Some other operations could be faster.

Success metrics: Improved insertion speed, especially the edges/sec.

Motivation: Current Graph Model uses RidBags that doesn't scale very well on insertion and they are hard to distribute/shard.

Description: OrientDB saves the outgoing and incoming edges as RIDBag (collection) of RIDs. RIDBags don't scale very well because:

the Tree version is not supported in distributed mode (HA)
when transformed into a tree, it suffers from concurrency problem because an exclusive lock
insertion of a new RID means a rebalance of the tree = more expensive than a linked list
hard to maintain and debug due to the complexity and poor documentation

So the solution would be avoiding to store the links to the edges inside the vertices, but rather use a NOT UNIQUE index against the edge's IN and OUT properties. In this way to traverse the edges, a lookup on the index is needed.

The lookup would return the collection of RIDs as edges. At this point, every create edge operation doesn't touch any vertices, but only edges are created.

Lightweight edges are not possible with this model because the edge's document is needed to update the index.

Alternatives: The current model.

Risks and assumptions:

No performance improvement
In order to use the new graph layout a migration is needed

Impact matrix

[ ] Storage engine
[ ] SQL
[ ] Protocols
[ ] Indexes
[ ] Console
[X] Java API
[ ] Geospatial
[ ] Lucene
[ ] Security
[ ] Hooks
[ ] EE

lvca commented 7 years ago

We created a POC, but the results weren't promising as expected. The cost of updating 2 composite indices (out/in and in/out) is too high and the final result is about 20% faster on write of the graph and more than 50% slower on traversing the graph. The interesting thing is that traversing vertices jumping the edges (with out()) is about 50% faster because edges are not loaded at all: the outgoing vertex is contained in the composite index key.

We're going to proceed with storing the vertex rid in the RIDBag and maintaining current graph layout. However, we still need a way to support tree ridbags with distributed.

smolinari commented 7 years ago

Hi Luca,

This is just a big shot in the dark on my part, but couldn't using an inverted index be a feasible way to distribute ridbags? Sort of like how DNS works, but as an internal index? It could theoretically also solve the super node issue too. Maybe? :smile:

Scott

orientechnologies / orientdb-labs

[OEP 18] New Graph Model #18