Database records in permanent deadlock state after running concurrent queries.

anukul-mohil commented 5 years ago

OrientDB Version: 3.0.3

Java Version: "1.8.0_131"

OS: CentOS 7.3

Expected behavior

Shouldn't be receiving ODistributedRecordLockedException for concurrent updates.

Actual behavior

While making async/concurrent write calls to orientdb cluster we're receiving following errors: b'{\r\n "errors": [\r\n {\r\n "reason": 500,\r\n "code": 500,\r\n "content": "com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException: Timeout (0ms) on acquiring lock on record #-1:-1 on server \'DeadLock\'. It is locked by request 0.53084\\u000d\\u000a\\u0009DB name=\\"test_db\\""\r\n }\r\n ]\r\n}' and b'{\r\n "errors": [\r\n {\r\n "reason": 409,\r\n "code": 409,\r\n "content": "com.orientechnologies.orient.core.exception.OConcurrentModificationException: Cannot READ the record #39:11 because the version is not the latest. Probably you are reaing an old record or it has been modified by another user (db=v4 your=v0)\\u000d\\u000a\\u0009DB name=\\"test_db\\"\\u000d\\u000a\\u0009Error Code=\\"3\\""\r\n }\r\n ]\r\n}'

Steps to reproduce

I've 3 master nodes with default-distributed-db-config.json being:

{ "autoDeploy": true, "readQuorum": 1, "writeQuorum": "majority", "executionMode": "undefined", "readYourWrites": true, "newNodeStrategy": "static", "servers": { "": "master" }, "clusters": { "internal": { }, "": { "servers": [""] } } }

I'm making concurrent write calls(creating edges between nodes already created) to each of the three(for load balancing) master nodes.

Issues: I understand the idea of record being locked since we're making async updates, but we seem to be getting this error quite a lot even with retries in place. It seemed like the code:409 error was resolved because of the retires but the code: 500 error remained. Also the error response seems a little weird with Timeout (0ms) and record ID being #-1:-1

Don't know if it's a bug or if I can just tune the timeout setting. Also Is there a recommended strategy to make high volume async calls to orientdb. I did try asynchronous-replication-mode by changing the executionMode to asynchronous but this error didn't seem to go away.

Any help is appreciated.

EDIT: Also now we cannot drop the database, we're receiving: com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException: Timeout (0ms) on acquiring lock on record #-1:-1 on server 'DeadLock'. It is locked by request 0.33934 DB name="test_db" when we try to run g.V().drop(). Possibly related to the errors above since no other operations were performed on the DB.

gtadudeps commented 5 years ago

We are also facing the same issue of com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException: Timeout (0ms) on acquiring lock on record #-1:-1 on server 'DeadLock'. Even when we retry it 50 times the exception persists not sure if this is broken or a need to tweak a setting.

UPDATE: The same codebase works absolutely fine with Orient v2.2.35.

rajgopalv commented 5 years ago

Hi Any update on this? I'm facing the same issue in 3.0.9

rajgopalv commented 5 years ago

com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException: Timeout (1000ms) on acquiring lock on record #-1:-1 on server 'DeadLock'. It is locked by request 0.6990
    DB name="***"
    DB name="***"
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.throwSerializedException(OChannelBinaryAsynchClient.java:318)
    at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.handleStatus(OChannelBinaryAsynchClient.java:275)
    at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:191)
    at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:153)
    at com.orientechnologies.orient.client.remote.OStorageRemote.beginResponse(OStorageRemote.java:1931)
    at com.orientechnologies.orient.client.remote.OStorageRemote.lambda$networkOperationRetryTimeout$2(OStorageRemote.java:345)
    at com.orientechnologies.orient.client.remote.OStorageRemote.baseNetworkOperation(OStorageRemote.java:404)
    at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperationRetryTimeout(OStorageRemote.java:328)
    at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperationNoRetry(OStorageRemote.java:358)
    at com.orientechnologies.orient.client.remote.OStorageRemote.commit(OStorageRemote.java:1119)
    at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.internalCommit(ODatabaseDocumentAbstract.java:2750)
    at com.orientechnologies.orient.core.tx.OTransactionOptimistic.doCommit(OTransactionOptimistic.java:534)
    at com.orientechnologies.orient.core.tx.OTransactionOptimistic.commit(OTransactionOptimistic.java:100)
    at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.commit(ODatabaseDocumentAbstract.java:2226)
    at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.commit(ODatabaseDocumentAbstract.java:2196)
    at org.apache.tinkerpop.gremlin.orientdb.OrientTransaction.doCommit(OrientTransaction.java:68)
    at org.apache.tinkerpop.gremlin.structure.util.AbstractTransaction.commit(AbstractTransaction.java:104)
    at org.apache.tinkerpop.gremlin.orientdb.OrientGraph.commit(OrientGraph.java:559)

Mp2017 commented 5 years ago

Hi Any update on this? I'm facing the same issue in 3.0.10 with a 2 node cluster.

Mp2017 commented 5 years ago

com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException: Timeout (1000ms) on acquiring lock on record #-1:-1 on server 'DeadLock'. It is locked by request 1.682084 DB name="xyz" DB name="xyz" at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.throwSerializedException(OChannelBinaryAsynchClient.java:318) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.handleStatus(OChannelBinaryAsynchClient.java:275) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:191) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:153) at com.orientechnologies.orient.client.remote.OStorageRemote.beginResponse(OStorageRemote.java:1931) at com.orientechnologies.orient.client.remote.OStorageRemote.lambda$networkOperationRetryTimeout$2(OStorageRemote.java:345) at com.orientechnologies.orient.client.remote.OStorageRemote.baseNetworkOperation(OStorageRemote.java:404) at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperationRetryTimeout(OStorageRemote.java:328) at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperationNoRetry(OStorageRemote.java:358) at com.orientechnologies.orient.client.remote.OStorageRemote.commit(OStorageRemote.java:1119) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.internalCommit(ODatabaseDocumentAbstract.java:2750) at com.orientechnologies.orient.core.tx.OTransactionOptimistic.doCommit(OTransactionOptimistic.java:534) at com.orientechnologies.orient.core.tx.OTransactionOptimistic.commit(OTransactionOptimistic.java:101) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.commit(ODatabaseDocumentAbstract.java:2226) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.commit(ODatabaseDocumentAbstract.java:2196) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.saveGraph(ODatabaseDocumentAbstract.java:2087) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.save(ODatabaseDocumentAbstract.java:2038) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.save(ODatabaseDocumentAbstract.java:84) at com.orientechnologies.orient.core.record.impl.ODocument.save(ODocument.java:2109) at com.orientechnologies.orient.core.record.impl.ODocument.save(ODocument.java:2100) at com.orientechnologies.orient.core.record.impl.OEdgeDelegate.save(OEdgeDelegate.java:453)

pettyandydog commented 5 years ago

I have same error when concurrent save edge or vertex

com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException: Timeout (1000ms) on acquiring lock on record #-1:-1 on server 'DeadLock'. It is locked by request 1.14233747 DB name="didigraph" DB name="didigraph" at sun.reflect.GeneratedConstructorAccessor62.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.throwSerializedException(OChannelBinaryAsynchClient.java:318) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.handleStatus(OChannelBinaryAsynchClient.java:275) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:191) at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:153) at com.orientechnologies.orient.client.remote.OStorageRemote.beginResponse(OStorageRemote.java:1770) at com.orientechnologies.orient.client.remote.OStorageRemote.lambda$networkOperationRetryTimeout$2(OStorageRemote.java:226) at com.orientechnologies.orient.client.remote.OStorageRemote.baseNetworkOperation(OStorageRemote.java:284) at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperationRetryTimeout(OStorageRemote.java:214) at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperationNoRetry(OStorageRemote.java:239) at com.orientechnologies.orient.client.remote.OStorageRemote.command(OStorageRemote.java:891) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentRemote.command(ODatabaseDocumentRemote.java:370) at com.orientechnologies.orient.jdbc.OrientJdbcPreparedStatement.executeCommand(OrientJdbcPreparedStatement.java:101)

hxgxs1 commented 5 years ago

Facing the same issue in 3.0.11. Did anybody find a fix for this?

jonsalvas commented 4 years ago

Same issue in 3.0.13. This should prioritized by the orient team ASAP as it is blocker for productive usage of orient.

orientechnologies / orientdb