Server deadlock with multiple concurrent binary clients

gerdhub commented 8 years ago

Using version 2.1.12, I have multiple concurrent threads, each using its own graph instance, reading, creating and updating records in the DB. After about 20.000 cycles, all clients hang. I debugged the server and noticed that_ all _binary clients threads (at the server) where stuck at:

"OrientDB <- BinaryClient (/10.1.100.22:61097)"
    at com.orientechnologies.common.concur.lock.OReadersWriterSpinLock.acquireWriteLock(OReadersWriterSpinLock.java:132)
    at com.orientechnologies.common.concur.lock.OModificationLock.prohibitModifications(OModificationLock.java:96)
    at com.orientechnologies.common.concur.lock.OModificationLock.prohibitModifications(OModificationLock.java:84)
    at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.synch(OAbstractPaginatedStorage.java:1276)
    at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.checkLowDiskSpaceAndFullCheckpointRequests(OAbstractPaginatedStorage.java:2894)
    at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.commit(OAbstractPaginatedStorage.java:1150)
    at com.orientechnologies.orient.core.tx.OTransactionOptimistic.doCommit(OTransactionOptimistic.java:606)
    at com.orientechnologies.orient.core.tx.OTransactionOptimistic.commit(OTransactionOptimistic.java:156)
    at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2679)
    at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2648)
    at com.tinkerpop.blueprints.impls.orient.OrientTransactionalGraph.commit(OrientTransactionalGraph.java:161)
    at com.orientechnologies.orient.graph.sql.OGraphCommandExecutorSQLFactory.runInConfiguredTxMode(OGraphCommandExecutorSQLFactory.java:262)
    at com.orientechnologies.orient.graph.sql.OCommandExecutorSQLCreateEdge.execute(OCommandExecutorSQLCreateEdge.java:153)
    at com.orientechnologies.orient.core.sql.OCommandExecutorSQLDelegate.execute(OCommandExecutorSQLDelegate.java:90)
    at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.executeCommand(OAbstractPaginatedStorage.java:1530)
    at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.command(OAbstractPaginatedStorage.java:1511)
    at com.orientechnologies.orient.core.command.OCommandRequestTextAbstract.execute(OCommandRequestTextAbstract.java:67)
    at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.command(ONetworkProtocolBinary.java:1323)
    at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.executeRequest(ONetworkProtocolBinary.java:400)
    at com.orientechnologies.orient.server.network.protocol.binary.OBinaryNetworkProtocolAbstract.execute(OBinaryNetworkProtocolAbstract.java:223)
    at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:77)

I tried to reduce the transaction sizes to an absolute minimum without luck. In this situation I even cannot shutdown the server cleanly but have to do a kill -9 and restart to re-enable binary clients to connect to the server.

This might relate to #4613

andrii0lomakin commented 8 years ago

Hi that is single stack trace, it does not show reason of deadlock. Could you provide full stack trace ?

andrii0lomakin commented 8 years ago

Hi, any updates ?

gerdhub commented 8 years ago

In my test case, I'm creating (and selecting) edges the following way:

public Edge addOrUpdateEdgeByParams(Vertex origin, String label, Vertex destination, Map<String, Object> params) {

        // select from HAS_TOKEN WHERE out = #16:0 and in = #15:0
        Iterable<Edge> result = new Query<Edge>(graph).SQLQuery("SELECT FROM " + label + " WHERE out = ? and in = ? AND " + getWhereClause(params), origin.getId(), destination.getId());
        Iterator iter = result.iterator();
        if (iter.hasNext()) {
            return (Edge) iter.next();
        } else {
            for (int i = 0; i < MAXTRIES; i++) {
                try {
                    Iterable<Edge> newvertexresult = (Iterable<Edge>) new Query<Edge>(graph).SQLCommand("CREATE EDGE " + label + " FROM ? TO ?", origin.getId(), destination.getId());
                    OrientEdge e = (OrientEdge) newvertexresult.iterator().next();
                    graph.commit();

                    setProperties(e, params);
                    return e;
                } catch (OConcurrentModificationException ex) {
                    reload(origin);
                    reload(destination);
                }
            }

        }

        throw new RuntimeException("Could not resolve OConcurrentModificationException");

    }

    public Iterable<T> SQLQuery(String sql_cmd, Object... params) {
        OSQLSynchQuery<T> qr = new OSQLSynchQuery<>(sql_cmd);
        Iterable<T> i = graph.command(qr).execute(params);
        return i;
    }

 public String getWhereClause(Map<String, Object> params) {
        List<String> where = new ArrayList<>();
        for (Map.Entry<String, Object> p : params.entrySet()) {
            String value = p.getValue().toString().replace("'", "\\'");
            where.add(p.getKey() + " = '" + value + "'");
        }

        return String.join(" AND ", where);
    }

public void setProperties(Element e, Map<String, Object> params) {
        OrientElement oe = (OrientElement) e;

        for (int i = 0; i < MAXTRIES; i++) {
            try {
                if (!oe.getProperties().equals(params)) {
                    oe.setProperties(params);
                }
                graph.commit();
                break;
            } catch (OConcurrentModificationException ex) {
                reload(oe);
            }
        }
    }

The edges in re have a unique index on in, out and the where params.

This function is executed multiple times by 8 concurrent threads.

After some time (about 10000 runs), the server hangs with the following thread dump: https://gist.github.com/gerdhub/7459ddda0565d2104892 (Server + Client)

As the server is running on a remote server, I cannot take a complete memory dump at the moment.

I meanwhile also updated to Version 2.1.13...

Does this help?

andrii0lomakin commented 8 years ago

Hi, yes. That is the same as https://github.com/orientechnologies/orientdb/issues/5875 . But I will not close it to double check. Also as with issue mentioned above we need the same data.

gerdhub commented 8 years ago

If the server keeps running in this state, I get the following errors after several hours - this happens after a "ordered" shutdown, however the server won't shut down and has to be killed -9 manually:

Exception in thread "OrientDB WAL Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB WAL Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB WAL Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB WAL Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB WAL Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB WAL Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "OrientDB Write Cache Flush Task (huebscher)" java.lang.OutOfMemoryError: GC overhead limit exceeded

The db is about 260 GB with max. 6 GB heap memory.

andrii0lomakin commented 8 years ago

Not on this state. We need server logs before this state and memory dump if possible.

gerdhub commented 8 years ago

Here is another thread dump of the server: https://gist.github.com/gerdhub/0ca623a6a46bbdba06fd

How may I share the heap dump (jmap, binary)?

andrii0lomakin commented 8 years ago

Good, could you send us server logs, that is very important. About heap dump, if I get you correctly jmap will be good to go.

gerdhub commented 8 years ago

This is the server log: https://gist.github.com/gerdhub/cf77bf38b5fc214bfc11

The last line is a query that is executed during the start of the client application, there are no entries when the deadlock happens...

The heap dump is about 280 MB - I sent it via we transfer to your email.

andrii0lomakin commented 8 years ago

well if you do not mind google drive would be good one, if you do not have space may share it with you.

andrii0lomakin commented 8 years ago

Sorry about server log, is that content of files from server's log directory ?

gerdhub commented 8 years ago

Yes - but it is the same as the console output... By the way, I just added the corresponding client thread dump to the server thread dump gist.

taburet commented 8 years ago

@gerdhub, I inspected the server heap dump you have sent and was unable to find any deadlocked threads in it. Seem like it is a wrong heap dump or it was captured at a wrong time due to miscommunication. Andrey said "We need server logs before this state and memory dump if possible." meaning the heap dump should be captured after the deadlock happened.

lvca commented 8 years ago

Quick question: do you have a composite index on edges?

andrii0lomakin commented 8 years ago

@lvca if you look at thread dump , you will find that it has nothing to do with index key locks. One of the reason is that we do not acquire index key locks inside of transaction. It is not needed.

gerdhub commented 8 years ago

Yes, there is a composite index, one on in and out vertices, not unique and one on in, out + properties, which is unique. So there are several edges between two nodes.

gerdhub commented 8 years ago

@taburet: Maybe we should define the term "deadlock" - there is no deadlock in the "classical" sense, but all binary clients (BinaryClient Threads) of the Server are in java.lang.Thread.State: WAITING at com.orientechnologies.common.concur.lock.OReadersWriterSpinLock.acquireReadLock(OReadersWriterSpinLock.java:77); with Locked ownable synchronizers:

locked <6ebd4989> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
locked <771816d3> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

However you call it, the clients wait for a response of the server (which is not delivered), and the server does not allow any new (binary) client connections. Studio works as long as no queries are made.

The Server cannot be shut down cleanly in this state, as it will not react to the sigint.

That is all I can say...

gerdhub commented 8 years ago

Btw: I just noticed that the same happens also with only one binary client in one single thread.

andrii0lomakin commented 8 years ago

@gerdhub could you repeat the same for embedded database ? I think I am starting to guess what is the problem.

gerdhub commented 8 years ago

I will try and report later on.

taburet commented 8 years ago

@gerdhub Anyway, I can't find instances of the Thread class named like "BinaryClient" in the provided heap dump, so I can't inspect the state of the "locks" associated with them.

gerdhub commented 8 years ago

@taburet There are 2 thread dumps in gist https://gist.github.com/gerdhub/0ca623a6a46bbdba06fd, one for the client (first) and one for the server (second). In the second one, there are several

OrientDB - BinaryClient (/10.1.90.16:42874)" - Thread t@29
    java.lang.Thread.State: WAITING

taburet commented 8 years ago

@gerdhub Yep, that is the thread dump, but I'm talking about the heap dump :) All I can tell from the thread dumps is that for some reason one of the spin locks has some potentially invalid state so the busy waiting loop spins forever. It will be helpful to see the actual state of the spin lock from a heap dump, that may help to diagnose the problem.

gerdhub commented 8 years ago

@taburet Sorry, I got you wrong. I will try to get a new heap dump.

gerdhub commented 8 years ago

@laa Provided that one updates the jna reference to the newest version :) - embedded version seems to work like a charm until now, even with 8 concurrent threads... Do you have a clue? ;)

gerdhub commented 8 years ago

Sorry, but it is the same with embedded version, although it happens later. Please see the following thread dump: https://gist.github.com/gerdhub/fb5ecf21623bc28a0669845fb33bd398 I will send the heap dump via we transfer.

andrii0lomakin commented 8 years ago

@gerdhub we found reason of issue and started to fix it. Thank you very much for bug report , your help and patience.

gerdhub commented 8 years ago

You're welcome!

andrii0lomakin commented 8 years ago

@gerdhub @TeamXceleratorDev @ygoraly fixed. could you try ?

andrii0lomakin commented 8 years ago

On latest 2.1.x branch

RAndrews137 commented 8 years ago

Sure thing. Thanks :)

gerdhub commented 8 years ago

I cannot build 2.1.x branch to test on the db server:

[INFO] ------------------------------------------------------------------------
[INFO] Building OrientDB Server 2.1.15-SNAPSHOT
[INFO] ------------------------------------------------------------------------

...

2016-04-01 16:11:43:288 SEVER Internal server error:
java.lang.IllegalStateException: Pool factory is closed [ONetworkProtocolHttpDb]
TOTAL 30 - MAX 10 - GOOD 30 - BAD 0
2016-04-01 16:11:43:416 INFO  {db=httpclass} Creating database memory:httpclass [OServerCommandPostDatabase]
2016-04-01 16:11:43:518 WARNI ->127.0.0.1: Command not found: GET.commandNotfound [ONetworkProtocolHttpDb]
2016-04-01 16:11:43:523 WARNI ->127.0.0.1: Command not found: POST.listDatabases [ONetworkProtocolHttpDb]Tests run: 47, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: 35.011 sec <<< FAILURE! - in TestSuite
batchUpdate(com.orientechnologies.orient.server.network.http.HttpBatchTest)  Time elapsed: 0.183 sec  <<< FAILURE!
java.lang.AssertionError: expected [200] but found [500]
    at com.orientechnologies.orient.server.network.http.HttpBatchTest.batchWithEmpty(HttpBatchTest.java:83)
    at com.orientechnologies.orient.server.network.http.HttpBatchTest.batchUpdate(HttpBatchTest.java:65)

testConnectAutoDisconnectKeepAlive(com.orientechnologies.orient.server.network.http.HttpConnectionTest)  Time elapsed: 14.71 sec  <<< FAILURE!
java.lang.AssertionError: expected [200] but found [500]
    at com.orientechnologies.orient.server.network.http.HttpConnectionTest.testConnectAutoDisconnect(HttpConnectionTest.java:83)
    at com.orientechnologies.orient.server.network.http.HttpConnectionTest.testConnectAutoDisconnectKeepAlive(HttpConnectionTest.java:57)

testConnectAutoDisconnectNoKeepAlive(com.orientechnologies.orient.server.network.http.HttpConnectionTest)  Time elapsed: 11.995 sec  <<< FAILURE!
java.lang.AssertionError: expected [200] but found [500]
    at com.orientechnologies.orient.server.network.http.HttpConnectionTest.testConnectAutoDisconnect(HttpConnectionTest.java:83)
    at com.orientechnologies.orient.server.network.http.HttpConnectionTest.testConnectAutoDisconnectNoKeepAlive(HttpConnectionTest.java:63)

Results :

Failed tests: 
  HttpBatchTest.batchUpdate:65->batchWithEmpty:83 expected [200] but found [500]
  HttpConnectionTest.testConnectAutoDisconnectKeepAlive:57->testConnectAutoDisconnect:83 expected [200] but found [500]
  HttpConnectionTest.testConnectAutoDisconnectNoKeepAlive:63->testConnectAutoDisconnect:83 expected [200] but found [500]

Tests run: 47, Failures: 3, Errors: 0, Skipped: 0

andrii0lomakin commented 8 years ago

hm that is latest 2.1.x build which is caused by this fix http://helios.orientdb.com/job/orientdb-2.1.x/164 .How did you build it ?

gerdhub commented 8 years ago

via

git clone https://github.com/orientechnologies/orientdb
cd orientdb/
git checkout 2.1.x
mvn clean install

andrii0lomakin commented 8 years ago

Well )) could you skip tests for this build we need only check specific fix. Any way issue in http client it will not affect us.

On Fri, Apr 1, 2016 at 6:06 PM Gerd Hübscher notifications@github.com wrote:

via

git clone https://github.com/orientechnologies/orientdb cd orientdb/ git checkout 2.1.x mvn clean install

— You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub https://github.com/orientechnologies/orientdb/issues/5821#issuecomment-204432061

Best regards, Andrey Lomakin, R&D lead. OrientDB Ltd

twitter:@Andrey_Lomakin linkedin:https://ua.linkedin.com/in/andreylomakin

gerdhub commented 8 years ago

Sure ;) I just wanted to make sure to have a valid build. I will give it a try and report later on.

gerdhub commented 8 years ago

Gentlemen, 2.1.x works with more than 14.000.000 edges and about 200.000 iterations over the above-mentioned code very well. Beforehand, the error occurred after about 20.000 iterations at the latest.

I think it is fixed. Thank you for your efforts!!

andrii0lomakin commented 8 years ago

I am really appreciate help of everyone who participated in fix of this issue. We will release new version today or tomorrow.

orientechnologies / orientdb

Server deadlock with multiple concurrent binary clients #5821