orientechnologies / orientdb

OrientDB is the most versatile DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master), supports SQL, ACID Transactions, Full-Text indexing and Reactive Queries.
https://orientdb.dev
Apache License 2.0
4.73k stars 870 forks source link

Dense node or not dense node ? ... #2878

Closed MarcelPitch closed 7 years ago

MarcelPitch commented 9 years ago

Hi All,

Here's the context. Our graph is composed of :

The schema looks like that :

v.Item ----- e.HAS_LABEL -----> v.Label ----- e.HAS_CULTURE -----> v.Culture

One Culture vertex has about 4M incoming HAS_CULTURE edges from Label vertices.

The problem is that a DELETE VERTEX Label WHERE @rid = #XX:XXXX request, for a Label connected on this dense vertex, takes about 30s to execute. But the same request takes about 1s on a Label vertex, which is connected on another Culture vertex with only 2 000 incoming HAS_CULTURE edges.

Could it be the reason of the slownless ?

If so, is there a way to ignore the amount of edges of the Culture dense vertex ?

Or do we have to create intermediate meta-nodes between Label and Culture vertices ?

Thank you in advance and have a good evening.

lvca commented 9 years ago

@laa RIDBag with so many edges should be a Tree RIDBag where lookup/remove should be log(n), not constant, right?

andrii0lomakin commented 9 years ago

yes, but one single item, inside a tree, but removal in graph db should update both ends of relationship which means that removal iterates through all 8M vertexes from other side of relation. So it means not logarithmic but linear complexity.

andrii0lomakin commented 9 years ago

Any way it is out of the scope of 2.0 for sure I remove milestone.

lvca commented 9 years ago

@laa can we add a return to .remove()? In this was we'd have the edge to traverse and updated the other node. WDYT?

andrii0lomakin commented 9 years ago

@lvca sorry I did not understand. do you propose to leave some vertexes on oposite side of relation unmodified. I mean our relation looks like v1 --[4M virtual edges] ---> v2 so to make data consistent we need to remove reference from 4M + 1 bags.

I would propose just to ignore tombstone during traverse and do background clean up for dead rids. Thanks Good we do not reuse rids. But may be I do not understand your idea.

lvca commented 9 years ago

Absolutely not. This is my idea in pseudo-code:

ORidBag collOut = v1.getEdgeCollection('Friend');
OIdentifiable v2 = collOut.remove( edge );
ORidBag collIn  getVertex( v2 ).getEdgeCollection('Friend');
collIn.remove( edge );

So the key is the line 2, where the remove returns the removed element. Can we support it?

andrii0lomakin commented 9 years ago

We do this, but we have million of such vertexes on other side because we remove current vertex, not edge, so we should remove links from all vertexes on other side.

Your case is edge removal not vertex removal. But issue is with command DELETE VERTEX Label WHERE @rid = #XX:XXXX

WDYT ?

andrii0lomakin commented 9 years ago

Here is edge#remove method.


    final String outFieldName = OrientVertex.getConnectionFieldName(Direction.OUT, edgeClassName, useVertexFieldsForEdgeLabels);
    final boolean outVertexChanged = dropEdgeFromVertex(inVertexEdge, outVertex, outFieldName, outVertex.field(outFieldName));

    // IN VERTEX
    final OIdentifiable outVertexEdge = vOut != null ? vOut : rawElement;
    final ODocument inVertex = getInVertex().getRecord();

    final String inFieldName = OrientVertex.getConnectionFieldName(Direction.IN, edgeClassName, useVertexFieldsForEdgeLabels);
    final boolean inVertexChanged = dropEdgeFromVertex(outVertexEdge, inVertex, inFieldName, inVertex.field(inFieldName));

so I do not expect problems here

lvca commented 9 years ago

@laa if ORIDBag doesn't provide a returning type for remove() method, this would be always slower because a full scan of the RIDBag must be done to get the opposite vertex.

About the original use case the operation is quite expensive because it:

So this operation is expensive, but by providing a fix in remove() could be faster.

Furthermore we could support an asynchronous mode where deletion is split in 2 operations:

Then create an asynchronous task that does:

The good is that all the Graph functions ignore NULL references, so the graph is coherent. The only problem would be the count of the connected edges/vertices that could be not precise right after synchronous deletion.

WDYT?

andrii0lomakin commented 9 years ago

@lvca We have already discussed this problem. Anyway lets provide full detail here to do not forget it any more.

  1. In real code in project we do not have cases when we need to do full scan. Because when we get edge by query or iteration, heavy or lightweight we always know both vertexes. In case we work with edge we get it either by iteration or by query. For heavy edges all is simple they have both vertexes as field values inside the record ("in" and "out"). For lightweight edges if we do iteration in reality we iterate by rids of opposite vertex, if we do query we test fields of opposite vertex too. So for lightweight edges we too always know rids of both vertexes during delete. So we never need full scan during edge or vertex delete. May be I missed something in such case could you provide real test case which shows that full scan is done I could not find it in code.
  2. Solution which you proposed for lightweight edges (I mentioned similar few comments above). The problem with heavyweight edges is that relations/edges will be returned for edge queries after vertex delete which is quite of fun. We of course can always check existence of referenced by edge vertexes during load but would be good to find solution which is not so straight. Also there is question will we allow staled links to dangle forever if we do server shutdown and background removal task is working ?

Anyway I do not think that we should put this feature in 2.0 because it will make it unstable. WDYT ?

lvca commented 9 years ago

Hi Andrey, forget (2) now. About 1 this is the flow:

OrientVertex.remove(); -> OrientVertex.removeEdges()

Here it's a RIDBag and not target vertex is provided so this piece of code is execute:

    // DELETE ALL THE EDGES
    for (Iterator<OIdentifiable> it = bag.rawIterator(); it.hasNext();) {
      final OIdentifiable edge = it.next();
      if (iAlsoInverse)
        removeInverseEdge(iVertex, iFieldName, null, edge, useVertexFieldsForEdgeLabels);

      deleteEdgeIfAny(edge);
    }

Now removeInverseEdge() call receives the "edge" param, so it calls removeEdges() again (for the inverse relationship), but this time passing the iVertexToRemove, so the code called is:

  if (iVertexToRemove != null) {
    // SEARCH SEQUENTIALLY (SLOWER)
    boolean found = false;
    for (Iterator<OIdentifiable> it = bag.rawIterator(); it.hasNext();) {
      final ODocument curr = it.next().getRecord();

      if (iVertexToRemove.equals(curr)) {
        // FOUND AS VERTEX
        it.remove();
        if (iAlsoInverse)
          removeInverseEdge(iVertex, iFieldName, iVertexToRemove, curr, useVertexFieldsForEdgeLabels);
        found = true;
        break;

      } else if (curr.getSchemaClass().isSubClassOf(OrientEdgeType.CLASS_NAME)) {
        final Direction direction = getConnectionDirection(iFieldName, useVertexFieldsForEdgeLabels);

        // EDGE, REMOVE THE EDGE
        if (iVertexToRemove.equals(OrientEdge.getConnection(curr, direction.opposite()))) {
          it.remove();
          if (iAlsoInverse)
            removeInverseEdge(iVertex, iFieldName, iVertexToRemove, curr, useVertexFieldsForEdgeLabels);
          found = true;
          break;
        }
      }
    }

    if (!found)
      OLogManager.instance()
          .warn(null, "[OrientVertex.removeEdges] edge %s not found in field %s", iVertexToRemove, iFieldName);

    deleteEdgeIfAny(iVertexToRemove);

  }

As you can see here, it does a full scan to look for the original vertex to delete only those edges. Well, we could change this code by calling:

bag.remove( iVertexToRemove );

// FIND THE EDGE WITH IN/OUT EQUALS TO iVertexToRemove
edge = ???
bag.remove( edge );

if (iAlsoInverse)
  removeInverseEdge(iVertex, iFieldName, iVertexToRemove, curr, useVertexFieldsForEdgeLabels);

So by calling 2 times remove() we are sure the edge is removed, either if it's a regular edge, or lightweight.

WDYT?

andrii0lomakin commented 9 years ago

@lvca I will fix few issue and will be back soon. Need to do mediation on code.

Ndrou commented 9 years ago

@MarcelPitch, I have the same problem when i delete a node with about 5 million edges. To bypass this problem, i have created a javascript server side function that delete all the edges on the target vertex, then, that delete the vertex.

Here the code (source is my rid param):

var g = orient.getGraph();

var query1 = g.command("sql", "DELETE EDGE FROM (SELECT expand(in()) FROM #" + source + ") TO #" + source);
var query2 = g.command("sql", "DELETE EDGE FROM #" + source + " TO (SELECT expand(out()) FROM #" + source + ")");
var query3 = g.command("sql", "DELETE VERTEX #" + source);

I hope it helps you ;)

MarcelPitch commented 9 years ago

Thank you, @Ndrou !

Now, the operation takes less than a second.

andrii0lomakin commented 8 years ago

Assigned to @tglman we have already discussed this issue with him