orientechnologies / orientdb-gremlin

TinkerPop3 Graph Structure Implementation for OrientDB
Apache License 2.0
91 stars 32 forks source link

query by label and performance #79

Closed ismaelpernas closed 8 years ago

ismaelpernas commented 8 years ago

I have several classes in my OrientDB database but using 3 classes for this example (ClassA, ClassB and ClassC), in a tree like relationship. There are 13K vertices in the DB of which 2K are ClassA, 2K are ClassB and 2K ClassC. I want to query for all dependencies of ClassA, so I started by creating the following piece of code:

GraphTraversal<Vertex, Vertex> query = g
   .V()
   .hasLabel("ClassA")
   .in("ClassA_ClassB")
   .in("ClassB_ClassC")
;

while (query.hasNext()) {
    Object next = query.next();
    System.out.println(next);
}

This piece of code prints all the ClassC as expected but it takes quite a long of time. About 3 seconds to return the 2K ClassC vertices. I also get the following message in the logs:

WARNING: scanning through all vertices without using an index for Traversal [OrientGraphStep(vertex,[~label.eq(ClassA)])....

Apparently the query is not hitting any index. I thought the method hasLabel would be able to use @class to filter down the vertices to hit based on the type. Is this a correct use of the hasLabel() method? Do you have any suggestion about how to improve the performance of such a query?

mpollmeier commented 8 years ago

you're using it correctly, and it's correct that it doesn't use an index. what you're doing is querying the whole class, that's like a full table scan in a relational db.

The reason it's slow is that it's executing your traversal locally and has to serialise all elements. Following two in edges for the numbers you mentioned already creates a reasonable amount of elements to serialise. I'm guessing you're connecting to a remote orient? To verify that thought: how fast is it if you connect to a local db (use something like plocal:myDatabasePath in the connect string)?

ismaelpernas commented 8 years ago

I tried with plocal and it went down to 955ms.

Is it something that using 2.6 version supported by OrientDB 2.2 will improve?

My ultimate goal is to be able to get a count of dependencies by class type. For example: "Give me ClassA items where status = active and a count of their dependencies". It returns 10 ClassA items and a list of counts: ClassA = items 1-10 and their properties ClassB = 8 ClassC = 12 ClassD = 20

mpollmeier commented 8 years ago

to improve the performance the driver would have to optimise your query, and run only the necessary steps on the DB side rather than serialising all these elements and sending them back and forth. that's totally doable, but quite a lot of effort to implement, and should be done by the orient team, we've just made a start here with a prototype driver. sorry to destroy your dreams :)

ismaelpernas commented 8 years ago

No problem Michael. Thanks for your reply!