pietermartin / sqlg

TinkerPop graph over sql
MIT License
244 stars 51 forks source link

Provision for unique ids while using bulkAddEdges #490

Closed SnehaMathankar closed 1 year ago

SnehaMathankar commented 1 year ago

I have created Edge Class with unique identifier column 'id'. I am trying to create Edges using bulkAddEdges function. I am unable to figure out how to achieve this. It would be of great help if I could get any suggestion on this.

pietermartin commented 1 year ago

bulkAddEdges is for joining up out and in vertex pairs in one call to the db. It is not possible to also set a unique identifier using it.

If you explain your use case a bit more, I might be able to help further.

SnehaMathankar commented 1 year ago

Sir, we have requirement to crawl database. Following are the details of database components:

schema -> 134 table -> 10366
column -> 253325 PK -> 6703
FK -> 32077

We have total 134 Schemas & 10366 Tables in those schemas. We have created 5 Vertex classes for Schemas, Tables, Columns, PrimaryKey & ForeignKey. We created all the vertexes.It took 2 min to complete this task.

Now, we have to link Schemas with Tables, Tables with Columns, Tables with PrimaryKey & Tables with ForeignKey. For this we have created Edges class 'Contains' with unique identifier column for the same. When we are creating Edges using 'addVertex' to create each link we are able to do so with unique identifier column. However, it takes almost 3 hour to complete this task.

To optimize this, we tried 'bulkAddEdges'. It improved the time. But we are unable to figure out how to add unique identifier column in this case as 'bulkAddEdges' have no provision for the same.

I am attaching code snippet. It would be of great help if you could suggest any way to achieve this. Thanks in advance. UsingAddEdge.txt UsingBulkAddEdges.txt

pietermartin commented 1 year ago

I had a look.

In the first one UsingAddEdge.txt I recommend always specifying the label in your queries. With no label specified, all VertexLabels with the particular property name will have to be queries. i.e. Vertex outVertex = sqlgGraph.traversal().V().has("id", "Schema"+j).next(); should be Vertex outVertex = sqlgGraph.traversal().V().hasLabel("Schema").has("id", "Schema"+j).next();

Your usage of bulkAddEdges is correct. Unfortunately there is no way at the moment to specify individual data for each edge using this method. I'd have to investigate if it possible.

Here is some code that might help.

@Test
public void issue490NormalBatchMode() {
    StopWatch stopWatch = StopWatch.createStarted();
    VertexLabel schemaVertexLabel = this.sqlgGraph.getTopology().ensureVertexLabelExist(
            this.sqlgGraph.getSqlDialect().getPublicSchema(), "Schema",
            new LinkedHashMap<>() {{
                put("id", PropertyDefinition.of(PropertyType.STRING, Multiplicity.of(1, 1)));
            }},
            ListOrderedSet.listOrderedSet(Collections.singletonList("id"))
    );
    VertexLabel tableVertex = this.sqlgGraph.getTopology().ensureVertexLabelExist(
            this.sqlgGraph.getSqlDialect().getPublicSchema(), "Table",
            new LinkedHashMap<>() {{
                put("id", PropertyDefinition.of(PropertyType.STRING, Multiplicity.of(1, 1)));
                put("schema", PropertyDefinition.of(PropertyType.STRING, Multiplicity.of(1, 1)));
            }},
            ListOrderedSet.listOrderedSet(Collections.singletonList("id"))
    );
    schemaVertexLabel.ensureEdgeLabelExist("contains", tableVertex,
            new LinkedHashMap<>() {{
                put("id", PropertyDefinition.of(PropertyType.STRING));
            }}
    );
    this.sqlgGraph.tx().commit();
    stopWatch.stop();
    LOGGER.info("time to create topology: {}", stopWatch);
    stopWatch.reset();
    stopWatch.start();

    this.sqlgGraph.tx().normalBatchModeOn();
    for (int i = 0; i < 134; i++) {
        this.sqlgGraph.addVertex(T.label, "Schema", "id", "Schema_" + i);
    }
    for (int i = 0; i < 10366; i++) {
        this.sqlgGraph.addVertex(T.label, "Table", "id", "Table_" + i, "schema", "Schema_" + (i % 134));
    }
    List<Vertex> schemas = this.sqlgGraph.traversal().V().hasLabel("Schema").toList();
    for (Vertex schema : schemas) {
        String schemaId = schema.value("id");
        List<Vertex> tables = this.sqlgGraph.traversal().V()
                .hasLabel("Table")
                .has("schema", schemaId)
                .toList();
        for (Vertex table : tables) {
            String tableId = table.value("id");
            schema.addEdge("contains", table, "id", schemaId + "-" + tableId);
        }
    }
    this.sqlgGraph.tx().commit();
    stopWatch.stop();
    LOGGER.info("time to insert: {}", stopWatch);
    stopWatch.reset();
    stopWatch.start();

    List<Map<String, Vertex>> schemaTables = this.sqlgGraph.traversal().V().hasLabel("Schema").as("schema")
            .out("contains").as("table")
            .<Vertex>select("schema", "table")
            .toList();
    for (Map<String, Vertex> schemaTable : schemaTables) {
        Vertex schema = schemaTable.get("schema");
        Vertex table = schemaTable.get("table");
        LOGGER.info("schema {} -> table {}", schema.value("id"), table.value("id"));
    }
    stopWatch.stop();
    LOGGER.info("time to query : {}", stopWatch);

}

On my laptop the time taken is, time to create topology: 00:00:00.063 time to insert: 00:00:00.592 bunch of logs time to query : 00:00:00.124

Let me know if this helps.

SnehaMathankar commented 1 year ago

Thankyou sir, I really appreciate your quick response. I will try the solution as suggested by you & revert back.

pietermartin commented 1 year ago

Closing this for now, feel free to reopen it.