pietermartin / sqlg

TinkerPop graph over sql
MIT License
246 stars 51 forks source link

1.4.0 regressions #230

Closed JPMoresmau closed 7 years ago

JPMoresmau commented 7 years ago

I tried to update our project to use 1.4.0 instead of 1.3.4. Only a couple of Topology changes to the code, but at runtime we see the following regressions:

pietermartin commented 7 years ago

Are the performance test using normal bulk mode?

JPMoresmau commented 7 years ago

I haven't looked at the performance in detail, but our code does g.tx().normalBatchModeOn();. We then only do normal write operations, not bulk load.

pietermartin commented 7 years ago

Afraid I can not duplicate the order by on edges issue. Added some tests in TestGraphStepOrderBy to try duplicate the issue.

For a single edge label the order by happens on the db and for multiple edge label it happens in java. The multiple label order by is new code. The query is still optimized and then sorted afterwards in java.

JPMoresmau commented 7 years ago

Ha ha! Sorry for not qualifying enough. Indeed all your tests pass, but still not my code. It turns out both vertices have the SAME label. Then we also do a values on the resulting vertex, but I'm not sure it plays. This test fails (note all vertices have same label A):

    @Test
    public void testOrderOnEdgeWithInVValues() {
        Vertex a1 = this.sqlgGraph.addVertex(T.label, "A");
        Vertex b1 = this.sqlgGraph.addVertex(T.label, "A","name","b1");
        Vertex b2 = this.sqlgGraph.addVertex(T.label, "A","name","b2");
        Vertex b3 = this.sqlgGraph.addVertex(T.label, "A","name","b3");
        a1.addEdge("ab", b1, "order", 0);
        a1.addEdge("ab", b2, "order", 11);
        a1.addEdge("ab", b3, "order", 2);
        this.sqlgGraph.tx().commit();
        DefaultGraphTraversal<Vertex, Object> traversal = (DefaultGraphTraversal<Vertex, Object>) this.sqlgGraph.traversal()
                .V(a1.id())
                .outE("ab")
                .order().by("order", Order.decr)
                .inV().values("name");
        //assertEquals(5, traversal.getSteps().size());
        List<Object> vertices = traversal.toList();
        //assertEquals(1, traversal.getSteps().size());
        assertStep(traversal.getSteps().get(0), true, false, false, false);
        assertEquals(3, vertices.size());
        assertEquals("b2", vertices.get(0));
        assertEquals("b3", vertices.get(1));
        assertEquals("b1", vertices.get(2));

        traversal = (DefaultGraphTraversal<Vertex, Object>) this.sqlgGraph.traversal()
                .V(a1.id())
                .outE("ab")
                .order().by("order", Order.incr)
                .inV().values("name");
        //assertEquals(5, traversal.getSteps().size());
        vertices = traversal.toList();
        assertEquals(1, traversal.getSteps().size());
        assertStep(traversal.getSteps().get(0), true, false, false, false);
        assertEquals(3, vertices.size());
        assertEquals("b1", vertices.get(0));
        assertEquals("b3", vertices.get(1));
        assertEquals("b2", vertices.get(2));
    }

In this case we seem to do an inner join on a sub query and we lost the order by.

JPMoresmau commented 7 years ago

Regarding performance, I'm not sure yet what it is. We have traversal on edges, and it seems we end up doing a first query to retrieve the edges, and other queries to retrieve the next vertex. In 1.3.4 the query to retrieve the vertex was of the form select <vertex properties> from <edge> inner join <vertex> where <edge>.ID=? whereas in 1.4.0 we have select 1 as index <vertex properties> from <edge> inner join <vertex> where <edge>.ID=1234 order by index because we generate code from a SchemaTableTree that has one parentIdsAndIndexes, so we lose the parameter (the edge id is hardcoded). Not sure if this causes the slowdown. We also seem to have queries with within clauses, that now generate inner join values(...) sql. I don't find anything like that in the debug output of 1.3.4 (no temporary tables usage) so I wonder if we don't have extra conditions that would slow us down. Actually, it seems we do things twice: we get the vertex from the edge as indicated above, and then below in the debug output I see a query that selects again all vertices with a VALUES clause that contains tuples edge id, incremental index...

pietermartin commented 7 years ago

The first query you mention is part of the 1.4.0 refactor for previously unoptimized gremlins. Previously for unoptimized gremlins, a sql query is executed for every incomiing start in the step. This has a massive performance impact because of latency. The refactor now caches all the incoming starts, i.e. it fully traverses the iterator. It then executes the current steps for all of the incoming starts in one query using a values expression in the from clause. This reduces latency a lot and also has the effect of changing the semantics from a depth first traversal to a breath first traversal. It also unfortunately is a deep change and requires many TinkerPop steps to be rewritten to take advantage of this new way.

If you have an example I can do profiling to see where things slow down. In my own production code at work we make use of the the RepeatStep to calculate transmission paths. This code uses the new way and is working so so. It is fast at first but slows down a lot when the incoming traversers gets to a 500 000 or so. I need to look further but investigation so far shows the slow down is in java with hashCode, path calculations and TinkerPop's TravererSet. Not really sure what yet what going on but the good news is that Postgresql remains very fast.

Previously P.within() would join insert the within values into a temporary table and join to that. Now I use the values expression in the from clause and join to that instead. As its all in memory it its faster. I have tested up to 1 000 000 values and so far neither Postgresql nor java has complained. This should however be made memory aware to deal with very large within values.

For your performance tests that are slower is the schema fully created upfront?

pietermartin commented 7 years ago

The steps so far optimized, i.e. unoptimized at first but converted to a breath first traversal is ChooseStep, OptionalStep, LocalStep and RepeatStep.

The index in the from clause is used to order the ResultSet. This is used to sort the results by the incoming starts. This is important so as not to loose the order of the incoming elements in case there was an OrderBy step previously.

JPMoresmau commented 7 years ago

A query on edges with a inV filter seems to be unoptimized:

vertexTraversal(this.sqlgGraph, v1).outE()
                    .where(__.inV().has("name",P.within("p4","p2")));

And I have four edges from v1 generates 5 queries: one to retrieve the edges and one for each edge with hardcoded edge id. Full test

    @Test
    public void testOutEWithAttributes() throws Exception {
        Vertex v1 = this.sqlgGraph.addVertex(T.label, "Person", "name", "p1");
        Vertex v2 = this.sqlgGraph.addVertex(T.label, "Person", "name", "p2");
        Vertex v3 = this.sqlgGraph.addVertex(T.label, "Person", "name", "p3");
        Vertex v4 = this.sqlgGraph.addVertex(T.label, "Person", "name", "p4");
        Vertex v5 = this.sqlgGraph.addVertex(T.label, "Person", "name", "p5");

        v1.addEdge("aaa", v2,"real",true);
        v1.addEdge("aaa", v3,"real",false);
        v1.addEdge("aaa", v4,"real",true,"other","one");
        v1.addEdge("aaa", v5,"real",false);

         this.sqlgGraph.tx().commit();

        GraphTraversal<Vertex, Edge> gt=vertexTraversal(this.sqlgGraph, v1).outE()
                    .where(__.inV().has("name",P.within("p4","p2")));

        assertEquals(2,gt.count().next().intValue());

    }
JPMoresmau commented 7 years ago

And if I add another inV() at the end of the traversal:

  GraphTraversal<Vertex, Vertex> gt=vertexTraversal(this.sqlgGraph, v1).outE()
                    .where(__.inV().has("name",P.within("p4","p2")))
                    .inV();

then I get one query for each edge, and then one big query giving me all the vertices:

SELECT
    1 as "index",
    "public"."V_Person"."ID" AS "alias1",
    "public"."V_Person"."name" AS "alias2"
FROM
    "public"."E_aaa" INNER JOIN
    "public"."V_Person" ON "public"."E_aaa"."public.Person__I" = "public"."V_Person"."ID"
WHERE
    "public"."E_aaa"."ID" = 4 AND ( "public"."V_Person"."name" in (?, ?))
ORDER BY
    "index"

(multiple times with different IDs) followed at the end by:

SELECT
    "index" as "index",
    "public"."V_Person"."ID" AS "alias1",
    "public"."V_Person"."name" AS "alias2"
FROM
    "public"."E_aaa" INNER JOIN
    "public"."V_Person" ON "public"."E_aaa"."public.Person__I" = "public"."V_Person"."ID" INNER JOIN
    (VALUES(1, 1),(3, 2)) AS tmp ("tmpId", "index") ON "public"."E_aaa"."ID" = tmp."tmpId"
ORDER BY
    "index"

It looks like we do a query on each edge to see which matches, and then a full query on all the matching edges. We should at least parametrize the first query, and then look at how to do one query instead of n+1.

pietermartin commented 7 years ago

where is not optimized so it is a query per edge. Is this the regression? I have not checked now on 1.3.4 but on then it was not optimized either I think. where and match needs to be optimized (starts cached) similar to SqlgChooseStep then it will do it in 2 queries. One for the vertex and one for all the edges.

pietermartin commented 7 years ago

Or a proper deeper optimization in GraphStrategy and VertexStrategy that interrogates the where traversal to see if it can be optimized. After that, if still unoptimized, it can get a secondary optimization via the starts cache trick.

pietermartin commented 7 years ago

I am still wondering where the performance degradation is that you are seeing. I ran the test you showed above, without the extra inV on 1.3.4 and master.

running it 10 000 times on 1.3.4 is about 21 seconds. on master its about 16 seconds.

So its actually faster on master though I really did not expect that as the query is more complex with an additional order by clause.

JPMoresmau commented 7 years ago

We've rewritten some code for other reasons and now we don't see a slowdown. You can close the issue, we'll see if we run into such an issue again. Thanks!

Le 27 sept. 2017 20:17, "pietermartin" notifications@github.com a écrit :

I am still wondering where the performance degradation is that you are seeing. I ran the test you showed above, without the extra inV on 1.3.4 and master .

running it 10 000 times on 1.3.4 is about 21 seconds. on master its about 16 seconds.

So its actually faster on master though I really did not expect that as the query is more complex with an additional order by clause.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pietermartin/sqlg/issues/230#issuecomment-332610729, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHdoXyvjCezXV8dAouJ8Yhkykle8L-nks5smpE0gaJpZM4PL05W .

pietermartin commented 7 years ago

And the dead locks, are they also gone for now?

JPMoresmau commented 7 years ago

The deadlocks were random so I could make them appear on 1.3 too, so it's not a 1.4 regressions. We fixed them by writing as much of the topology we can first, commiting, and doing the longer operations without requiring the topology lock.

pietermartin commented 7 years ago

Ok, I'll close the issue for now, hopefully we can reproduce it some day. Thanks.