Can't acquire lock on requested resource: local:${ORIENTDB_HOME}/databases/demo [moved]

lvca commented 11 years ago

This is Issue 287 moved from a Google Code project. Added by 2011-03-31T22:07:06.000Z by Anton.A....@gmail.com. Please review that bug for more context and additional comments, but update this bug. Closed (Fixed).

Original labels: Type-Defect, Priority-Medium, v1.0rc4

Original description

Luca, while developing PHP extension I once again step on message:
 com.orientechnologies.common.concur.lock.OLockException: Can't acquire lock on requested resource: local:${ORIENTDB_HOME}/databases/demo

I can't say it appears too frequently, but I spotted it at r2560 r2517 and earlier for sure!
I know, that this is not perfect, but I can't provide more details to you. I try to reproduce it with 100 000 new connections, but had no success. 
Please, look one more time on sources..

<b>Please use labels and text to provide additional information.</b>
1.0-rc1 r2560 
Ubuntu 10.10 x86
$ uname -a
Linux aterekhov-VirtualBox 2.6.35-27-generic #48-Ubuntu SMP Tue Feb 22 20:25:29 UTC 2011 i686 GNU/Linux
B
$ java -version
java version &quot;1.6.0_20&quot;
OpenJDK Runtime Environment (IcedTea6 1.9.7) (6b20-1.9.7-0ubuntu1)
OpenJDK Client VM (build 19.0-b09, mixed mode, sharing)

pellyadolfo commented 11 years ago

Hi, I am having this issue for months. Right now I am running OrientDB 1.3 in CentOS server. When the issue appears, after a common request, with no concurrency, database never runs again (for that query?) and I have to clear and reinstall ALL the data. The problem is that I cannot move forward because the issue always reappears for next release and I have to work in circles, install and reinstall, upgrade, hack and reinstall. I was thinking to make some mechanism to redeploy the system automatically because is really time consuming.

It would be great getting rid of this issue but I do not know how to reproduce it.

Here is the server log:

com.orientechnologies.common.concur.lock.OLockException: Cannot acquire lock on requested resource: remote:localhost/mydb at com.orientechnologies.common.concur.resource.OResourcePool.getResource(OResourcePool.java:44) at com.orientechnologies.orient.core.db.ODatabasePoolAbstract.acquire(ODatabasePoolAbstract.java:65) at com.orientechnologies.orient.core.db.ODatabasePoolAbstract.acquire(ODatabasePoolAbstract.java:53) at

I acquire the database (trying to use resources at the minimum to minimize problems) as:

private OGraphDatabase getDatabase() {
    if ((database == null) || database.isClosed()) {
        OGlobalConfiguration.CACHE_LEVEL1_ENABLED.setValue(false);
        OGlobalConfiguration.CACHE_LEVEL2_ENABLED.setValue(false);
        OGlobalConfiguration.DB_MVCC.setValue(false);
        database = OGraphDatabasePool.global().acquire("remote:localhost/mydb", "admin", "admin");
        database.setMVCC(false);
        database.getLevel1Cache().setEnable(false);
        database.getLevel2Cache().setEnable(false);
    }
    return database;
}

and this is how I invoke to my query:

    List<ODocument> list = getDatabase().query(new OSQLSynchQuery<ODocument>(query));

Any idea? I am having the issue intermittently for months and does not seem to get solved. Is there any way to investigate the status of the pool/release the pool from outside? Maybe programmatically? Any option to add log levels on connection acquisition for future to get more debug info on the root cause? (because I cannot reproduce this error in my dev environment).

pellyadolfo commented 11 years ago

There is an strange symptom in my current deployment which is that, after the failure happened, some queries still work and some queries does not. So, it looks to be a coupling between the query string and the pool to get the connection. (Are connections cached per query?) But in this case, queries that work say: "WARNING: Connection re-acquired transparently after 1005ms and 2 retries: no errors will be thrown at application level".

Should I just redeploy till the next time? Should I upgrade latest sources to surely get the same?

pellyadolfo commented 11 years ago

It looks that killing and launching again tomcat, which interfaces to OrientDB, it works again. It like the pool gets restored. Happily this confirms that is a problem on client side pool, not with data. Any idea?

lvca commented 11 years ago

Have you seen this problem again recently?

pellyadolfo commented 11 years ago

this problem is a pain in version 1.3, at least for me. I have to restart my Tomcat instance everyday (which invokes OrientDB thru connection pool) because something in the client side of the connection pool gets stalled. Aprox. 20% of requests force me to restart Tomcat so I is really blocking me in 1.3 in a test-like environment.

However, I have not tested enough 1.4 to know if it happens. Actually in my test to verify the queries in 1.4, so far, it did not happen. As I do not see anyone else complaining about it I am assuming it is (hopefully) fixed because I also know you did quite a few work also in the connection pool.

So really looking forward 1.4 only because this issue.

lvca commented 11 years ago

So I'm closing it. In case reopen this one ;-)

pellyadolfo commented 11 years ago

yep, thanks

pellyadolfo commented 11 years ago

After deploying latest snapshot today I found again the same issue. com.orientechnologies.common.concur.lock.OLockException: Cannot acquire lock on requested resource: remote:localhost/mydb

So, I did a change on my code. Previously I was creating an instance of pool in different DAOs in my code. I moved these pool definitions up in the hierarchy of my DAOs and defined an unique static instance shared by all DAOs. Now it seems to work.

Therefore, the problem is that OrientDB is not able to deal with several pool instances at the same time, I think. But, by creating an unique static instance everything seems to be ok.

lvca commented 11 years ago

Have you set the client.channel.maxPool to a higher value? These are the TCP/IP sockets used by the client to connect to the server.

pellyadolfo commented 11 years ago

No, finally what I did was declaring an unique pool for all my apps (as an static instance) and leave connection per request (obviously). In this way it works. I think there is some issue if you define multiple pool instances but I have not tested as I made it work in that way.

Do not know if there is a wiki page about how is the connection pool is expected to be managed but I think would be helpful.

pellyadolfo commented 11 years ago

Anyway the issue is still happening to me. Not as often as before (making the pool static reduced the number of crashes) but still happening. The point is that it happens only on CentOS, not in Eclipse or Win7. Cannot reproduce the test case as it looks random. I do not discard to be doing something "wrong" with the connection pool, for example using a different pool for each Tomcat webapp (?).

I do not have any concurrency to justify moving the client.channel.maxPool margings.

The exception is always in this line and happens randomly:

        OGraphDatabase database = pool.acquire("remote:localhost/onto", "admin", "admin");

and looks like:

        com.orientechnologies.common.concur.lock.OLockException: Cannot acquire lock on requested resource:                 remote:localhost/onto
        at com.orientechnologies.common.concur.resource.OResourcePool.getResource(OResourcePool.java:44)
        at com.orientechnologies.orient.core.db.ODatabasePoolAbstract.acquire(ODatabasePoolAbstract.java:67)
        at com.orientechnologies.orient.core.db.ODatabasePoolAbstract.acquire(ODatabasePoolAbstract.java:54)
        at com.orientechnologies.orient.core.db.ODatabasePoolBase.acquire(ODatabasePoolBase.java:115)
        at ....

Is there any way to log the number of connections available/taken/idle..., status... from the pool? I would expect some method in OGraphDatabasePool. Just to add some logging before opening to understand.

Unfortunately there is not any additional log to know what is happening on CentOS.

lvca commented 11 years ago

In order to use better resources and reduce delays between concurrent requestes you've to increase the client.channel.maxPool that by default is 5. Then how the connection pool is sized?

pellyadolfo commented 11 years ago

As commented above there is NO concurrency in my system, just me, testing. Just in case I have added

        OGlobalConfiguration.CLIENT_CHANNEL_MAX_POOL.setValue(500);

to see if it changes but it does not make sense to me changing the pool size with no concurrency of requests.

The pool is sized by default.

private OGraphDatabase getDatabase() {
    if ((database == null) || database.isClosed()) {
        OGlobalConfiguration.CACHE_LEVEL1_ENABLED.setValue(false);
        OGlobalConfiguration.CACHE_LEVEL2_ENABLED.setValue(false);
        OGlobalConfiguration.DB_MVCC.setValue(false);
        OGlobalConfiguration.CLIENT_CHANNEL_MAX_POOL.setValue(500);
        if (pool == null)
            pool = OGraphDatabasePool.global();
        database = pool.acquire("remote:localhost/onto", "admin", "admin");
        database.setMVCC(false);
        database.getLevel1Cache().setEnable(false);
        database.getLevel2Cache().setEnable(false);
    }
    return database;
}

I would update here if adding CLIENT_CHANNEL_MAX_POOL to 500 changes anything.

Update: confirmed, same problem - Cannot acquire lock on requested resource: remote:localhost/onto at 5th request in 30 minutes after restarting tomcat.

pellyadolfo commented 11 years ago

I have also added:

        OGlobalConfiguration.CLIENT_CONNECT_POOL_WAIT_TIMEOUT.setValue(500);
        OGlobalConfiguration.CLIENT_DB_RELEASE_WAIT_TIMEOUT.setValue(500);

because in the code of getResource there is a iMaxWaitMillis parameter

public V getResource(K iKey, final long iMaxWaitMillis, Object... iAdditionalArgs) throws OLockException {
// First, get permission to take or create a resource
try {
  if (!sem.tryAcquire(iMaxWaitMillis, TimeUnit.MILLISECONDS))
    throw new OLockException("Cannot acquire lock on requested resource: " + iKey);
} catch (InterruptedException e) {
  Thread.currentThread().interrupt();
  throw new OLockException("Cannot acquire lock on requested resource: " + iKey, e);
}
.......
}

but not luck, same issue: Cannot acquire lock on requested resource: remote:localhost/onto

lvca commented 11 years ago

Seems the pool is exhausted in some way. Are you sure you ALWAYS close the database instance? Can you post the code?

pellyadolfo commented 11 years ago

Well, this why I was asking if there is any way to know how many databases are available in the pool before requesting a new one or verify database status after opening or some related check.

Code to close database is called in a polimorphism so cannot really paste the full flow:

protected void post() {
    if ((database != null) && !database.isClosed())
        database.close();
    //pool.close();
}

I commented out the line to close the pool so the full app uses the same pool but not change.

lvca commented 11 years ago

Are you sure the close is called in a try{}finally{} block?

pellyadolfo commented 11 years ago

yes, always in finally

pellyadolfo commented 11 years ago

Following your comment about exhausted connections, I did a change on some nested calls that I have an that, potentially could be leaving open connections, and for the moment it seems to work..... Maybe that was the problem, hopefully.

If so, maybe a error message as: "Not more connections available", where appropriated, could save days of headache.

Update: confirmed. My nested calls where not passing by the finally. It works now. What a relief!!!

Thanks

lvca commented 11 years ago

Just done:

"Not more resources available in pool. Requested resource: xxxxx"

That pool is generic and "resource" is more appropriated than connection.

pellyadolfo commented 11 years ago

What about notifying the number of resources (connections) available and also the number taken: something as

"10 of 10 resources has been taken. No more resources available in pool...".

it would allow distinguishing this error from another error sharing the same message. For example it is says

"9 of 10 resources has been taken. No more resources available in pool...".

Obviously there is something else.

Just an idea in case these 2 numbers are available at the point of raising the exception.

Well, just ignore it. I see the code (https://github.com/nuvolabase/orientdb/commit/4b0c4c623bdf02099ecced981bf95b7df5931519) and they are not available at the point of raising the exception. The risk is that this exception can be raised by more reasons inside tryAcquire() (e.g. a NPE), and message can be misleading.

orientechnologies / orientdb

Can't acquire lock on requested resource: local:${ORIENTDB_HOME}/databases/demo [moved] #287

Original description