orientechnologies / orientdb

OrientDB is the most versatile DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master), supports SQL, ACID Transactions, Full-Text indexing and Reactive Queries.
https://orientdb.dev
Apache License 2.0
4.72k stars 870 forks source link

Add support for dynamic edge labels in Multi-model API similar to Tinker 2.x integration #8014

Closed metametametameta closed 2 years ago

metametametameta commented 6 years ago

OrientDB Version: 3.0.0 RC1

Java Version: Java 8

OS: Windows 8.1

Expected behavior

We're currently migrating to Orient 3.x and are rewriting our code to use the multi-model API. Previously, we mainly used the Tinkerpop 2.x Orient integration. We'd like to exclusively use the Orient Native APIs now. We also use native Orient (graph) queries heavily.

We're running into a major issue with "edge classes" in the multi-model API.

Our application tends to have a huge number of possible edge labels - so it's not feasible to create an edge class for each such label. In Tinkerpop 2.x, we were able to use dynamic labels quite easily using the following configuration:

ALTER DATABASE custom useClassForEdgeLabel=false ALTER DATABASE custom useVertexFieldsForEdgeLabels=true

This way, in Tinkerpop 2.x, we only have the single default "E" class - and are able to create additional edges with whatever labels (at runtime) that do not require edge classes. Moreover, the native Orient queries using outE('some label') or the Tinkerpop 2.x Vertex/Edge APIs or similar work quite well as only the relevant edges on the vertex are looked up. We certainly don't want to look up "all" edges on a vertex and perform a post-filtering step based on the label.

Actual behavior

Using an edge label in the Multi-model API that does not have a corresponding edge class is not supported (as in Tinkerpop 2.x. )

Steps to reproduce

Create an Edge between vertices (using the multi-model API) using a custom label but without creating an edge class for the label first. Orient requires an edge class and attempts to create a class on the fly if missing. If we're in the middle of a transaction we also get an error message. As far as I can tell, there is no corresponding mechanism in the 3.x multi-model API re. dynamic edge labels to mimic what could be easily done in the Tinkerpop 2.x integration.

luigidellaquila commented 6 years ago

Hi @metametametameta

For now there is no support for "dynamic" edge labels in the new Multi-model API (and I'd say it's too late to have it in the first GA).

We can consider to add it later though.

What you can do, in the short term, is to manually add a "label" field to edge records and then filter based on that attribute when querying. I understand it's a bit trickier in terms of query syntax and API usage.

I'm flagging this as an enhancement for now

Thanks

Luigi

metametametameta commented 6 years ago

Hi Luigi,

Thanks for your response and we hope this enhancement will be implemented post first GA. We can pursue two alternate solutions in the meantime - see below:

1) Store label field on edge as you have suggested. In this case (using the syntax outlined at http://orientdb.com/orientdb-improved-sql-filtering/), I have something like

select expand(out()[label='myLabel']) from MyClazz

How would this query perform relative to the "standard" way below? A given vertex can have say 6 to 7 different outgoing edge labels - but often we just want to traverse one edge label.

select expand(out('myLabel')) from MyClazz

2) Another option would be to create a huge number of edge classes - would there be any problem dong this? Is there any limit to the number of edge classes used (other than disk space). We would probably have 2 clusters per each unique edge class - but we can likely go over 32K distinct edge classes.

Thanks, Harish.

metametametameta commented 6 years ago

Also, a few design suggestions for dynamic edge labels when it gets implemented.

1) In Tinkerpop 2.x, it looks like if we choose dynamic edge labels, then it's no longer possible to use class-based edge labels (i.e. it's one or the other). Ideally, the multi-model API should allow both cases to co-exist, i.e. if an edge class exists, use that otherwise use dynamic labels (i.e. the default "E" class).

2) There is a pattern that seems to appear over and over with dynamic edge labels. For faster traversal, we often create multiple variations of an edge label depending on the "path" we want to traverse. Assuming that the edge classes are not related via subclassing (as is the case with dynamic edge labeling), having a wildcard match on the edge label would be quite helpful.

So for example, if we have several dynamic edge labels that all have a common prefix 'Friend', then something like outE('Friend*') would be quite useful since there is no polymorphism possible on a common base edge class.

luigidellaquila commented 6 years ago

Hi @metametametameta

the actual query would be something like

select expand(outE()[label='myLabel'].inV()) from MyClazz

You can also use OR conditions to filter based on multiple labels.

If you already know that you will have tens of thousands of edge labels, I'd go with option 1

Thanks

Luigi

metametametameta commented 6 years ago

Ok, got it. My remaining question is whether

select expand(outE()[label='myLabel'].inV()) from MyClazz

will have different performance from

select expand(out('myLabel')) from MClazz

when multiple labels are present on the vertex. i.e. does the first query above traverse all these other labels and take all but one of them out "later" when it sees the label filter?

luigidellaquila commented 6 years ago

Hi @metametametameta

when you use different classes for edges, also the edge links are stored on different collections on vertex records. Eg. if you have two edge classes Foo and Bar, you will have two distinct out properties on the edges, ie. out_Foo and out_Bar In this situation, when you traverse out("Foo"), the engine will only inspect out_Foo collection and will discard out_Bar.

If you represent edge types as labels, you will have only a single out field on the vertex, so when you traverse out("Foo") the engine will have to fetch all the edges (both Foo and Bar), filter by label and then fetch the corresponding vertices.

As you can understand, you will pay it a little bit in terms of performance, but if you don't have too many edges per vertex it is typically not an issue

Thanks

Luigi

metametametameta commented 6 years ago

That might be an issue for us as we have many edges per vertex (with different labels) - so the partitioning by distinct properties is quite important for performance reasons. We may have to go with Option 2 to avoid the performance hit. Do you see any problems with using a large number of edge classes/clusters? I believe 3.0 now uses an Integer for cluster ids - so the older 32K limit no longer applies?

luigidellaquila commented 6 years ago

There is no particular issue having a large number of clusters, but the limit of 32k still applies unfortunately

Thanks

Luigi

metametametameta commented 6 years ago

Hi Luigi,

Thanks for your quick response. We might be able to get way with the 32K limit right now. Any plans to extend the 32K limit in the future?

Harish.

luigidellaquila commented 6 years ago

Hi @metametametameta

We had it in the roadmap, but for now it's on hold, we don't have a specific scheduling for it

Thanks

Luigi