orientechnologies / orientdb

OrientDB is the most versatile DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master), supports SQL, ACID Transactions, Full-Text indexing and Reactive Queries.
https://orientdb.dev
Apache License 2.0
4.75k stars 871 forks source link

Remote Orientdb reaching 100% Memory #4890

Closed sajid2045 closed 9 years ago

sajid2045 commented 9 years ago

Please find the latest code & heapdumps here:

https://dl.dropboxusercontent.com/u/5968302/orient-errors.zip

After running load tests for a long time, orient-server seems to reach 100% memory. I added the heap-dump etc for you to check.

valeriotarenghi commented 9 years ago

We took it in charge and we are working on it. We will provide a feedback soon.

tglman commented 9 years ago

Hi @sajid2045,

I checked the dumps, i don't see any memory leak, the memory used is high but most of the memory could be garbage collected, also the most of that memory was allocated by the profiler and the debug logging.

Did you have any OutOfMemoryException ?

In any case is normal after a load test to have high memory usage, especially with profiling and logging enabled.

sajid2045 commented 9 years ago

Hi ,

The requests which takes 30 ms was taking 6000ms. All I did was run 10 threads for 1 hour running about 18,000+ requests. The database had only 10,000 nodes so it not expected to behave that way at all. Even worse, the response time stayed at 4000 + ms even after I stopped the loadtest. I had to restart the DB and it went back to 30ms like before. So the memory was definitely not being collected.

This is really unacceptable as you can see, we are very unlikely to restart production DB.

-Sajid.

On Mon, Aug 31, 2015 at 8:27 PM, tglman notifications@github.com wrote:

Hi @sajid2045 https://github.com/sajid2045,

I checked the dumps, i don't see any memory leak, the memory used is high but most of the memory could be garbage collected, also the most of that memory was allocated by the profiler and the debug logging.

Did you have any OutOfMemoryException ?

In any case is normal after a load test to have high memory usage, especially with profiling and logging enabled.

— Reply to this email directly or view it on GitHub https://github.com/orientechnologies/orientdb/issues/4890#issuecomment-136326576 .

tglman commented 9 years ago

Hi @sajid2045,

I'll try to run the test for reproduce the delay, from the dump the strong retained memory is 81MB this should not be the cause of your delay, but i will check it.

seeden commented 9 years ago

I can see this behavior in the production. When I will run the import script with 100 000 records orientdb step by step will gain memory and the insert command is slower and slower

tglman commented 9 years ago

Hi,

The 2.1.1 is out, will be cool if you can try also with that version.

Checking the last code i saw that now you use every time no tx database, this is ok, but it's not suggested for creating edges between vertex, is it there any specific reason for that?

sajid2045 commented 9 years ago

I can see this is a consistent behavior , I can run my load test for 2 hours and the orientdb will go into 100% memory and will not even respond sometime. The server stays in the same state even after i stop the loadtest and I have to restart it.

Also, After running over-night, I got this exception from client but interesting point is /usr/local/graphdb/default/databases/subscription-service/ is located on server and I am definitely using 'remote' to connect from client:

[2015-09-01 09:51:37,359] [get test] [ERROR] [org.mule.exception.AbstractExceptionListener:319] [logException] [serviceName:SubscriptionService]=>


Message : Failed to invoke au.com.foxsports.subscription.service.SubscriptionServiceImpl@6439a027. Message payload is of type: String Type : org.mule.api.MessagingException Code : MULE_ERROR-29999 Payload : test JavaDoc : http://www.mulesoft.org/docs/site/current3/apidocs/org/mule/api/MessagingException.html


Exception stack is:

  1. File '/usr/local/graphdb/default/databases/subscription-service/database.ocf' is locked by another process, maybe the database is in use by another process. Use the remote mode with a OrientDB server to allow multiple access to the same database. (com.orientechnologies.common.concur.lock.OLockException) com.orientechnologies.orient.core.storage.fs.OFileClassic:713 (null)
  2. Cannot load database's configuration. The database seems to be corrupted. (com.orientechnologies.orient.core.exception.OSerializationException) com.orientechnologies.orient.core.storage.impl.local.OStorageConfigurationSegment:84 (null)
  3. Cannot open local storage '/usr/local/graphdb/default/databases/subscription-service' with mode=rw (com.orientechnologies.orient.core.exception.OStorageException) com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage:244 (null)
  4. Failed to invoke au.com.foxsports.subscription.service.SubscriptionServiceImpl@6439a027. Message payload is of type: String (org.mule.api.MessagingException) org.mule.processor.InvokerMessageProcessor:178 (http://www.mulesoft.org/docs/site/current3/apidocs/org/mule/api/MessagingException.html)

Root Exception stack trace: com.orientechnologies.common.concur.lock.OLockException: File '/usr/local/graphdb/default/databases/subscription-service/database.ocf' is locked by another process, maybe the database is in use by another process. Use the remote mode with a OrientDB server to allow multiple access to the same database. at com.orientechnologies.orient.core.storage.fs.OFileClassic.lock(OFileClassic.java:713) at com.orientechnologies.orient.core.storage.fs.OFileClassic.openChannel(OFileClassic.java:770) at com.orientechnologies.orient.core.storage.fs.OFileClassic.open(OFileClassic.java:552) at com.orientechnologies.orient.core.storage.impl.local.OSingleFileSegment.open(OSingleFileSegment.java:51) at com.orientechnologies.orient.core.storage.impl.local.OStorageConfigurationSegment.load(OStorageConfigurationSegment.java:64) at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.open(OAbstractPaginatedStorage.java:187) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.open(ODatabaseDocumentTx.java:249) at com.orientechnologies.orient.server.OServer.openDatabase(OServer.java:724) at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.openDatabase(ONetworkProtocolBinary.java:780) at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.executeRequest(ONetworkProtocolBinary.java:289) at com.orientechnologies.orient.server.network.protocol.binary.OBinaryNetworkProtocolAbstract.execute(OBinaryNetworkProtocolAbstract.java:223) at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:77)


sajid2045 commented 9 years ago

Also, The client does not recover from a database reset, I had to restart the clients too!

Caused by: com.orientechnologies.common.io.OIOException: Error on connecting to fsasydgrhdb01.foxsports.com.au:2424/subscription-service at com.orientechnologies.orient.client.remote.ORemoteConnectionManager.createNetworkConnection(ORemoteConnectionManager.java:246) ~[orientdb-client-2.1.0.jar:2.1.0] at com.orientechnologies.orient.client.remote.ORemoteConnectionManager$1.createNewResource(ORemoteConnectionManager.java:80) ~[orientdb-client-2.1.0.jar:2.1.0] at com.orientechnologies.orient.client.remote.ORemoteConnectionManager$1.createNewResource(ORemoteConnectionManager.java:77) ~[orientdb-client-2.1.0.jar:2.1.0] at com.orientechnologies.common.concur.resource.OResourcePool.getResource(OResourcePool.java:94) ~[orientdb-core-2.1.0.jar:2.1.0] at com.orientechnologies.orient.client.remote.ORemoteConnectionManager.acquire(ORemoteConnectionManager.java:101) ~[orientdb-client-2.1.0.jar:2.1.0] at com.orientechnologies.orient.client.remote.OStorageRemote.getAvailableNetwork(OStorageRemote.java:2103) ~[orientdb-client-2.1.0.jar:2.1.0] ... 178 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.7.0_60] at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) ~[?:1.7.0_60] at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) ~[?:1.7.0_60] at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) ~[?:1.7.0_60] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.7.0_60] at java.net.Socket.connect(Socket.java:579) ~[?:1.7.0_60] at com.orientechnologies.orient.enterprise.channel.binary.OChannelBinaryAsynchClient.(OChannelBinaryAsynchClient.java:83) ~[orientdb-enterprise-2.1.0.jar:2.1.0] at com.orientechnologies.orient.client.remote.ORemoteConnectionManager.createNetworkConnection(ORemoteConnectionManager.java:233) ~[orientdb-client-2.1.0.jar:2.1.0] at com.orientechnologies.orient.client.remote.ORemoteConnectionManager$1.createNewResource(ORemoteConnectionManager.java:80) ~[orientdb-client-2.1.0.jar:2.1.0] at com.orientechnologies.orient.client.remote.ORemoteConnectionManager$1.createNewResource(ORemoteConnectionManager.java:77) ~[orientdb-client-2.1.0.jar:2.1.0] at com.orientechnologies.common.concur.resource.OResourcePool.getResource(OResourcePool.java:94) ~[orientdb-core-2.1.0.jar:2.1.0] at com.orientechnologies.orient.client.remote.ORemoteConnectionManager.acquire(ORemoteConnectionManager.java:101) ~[orientdb-client-2.1.0.jar:2.1.0] at com.orientechnologies.orient.client.remote.OStorageRemote.getAvailableNetwork(OStorageRemote.java:2103) ~[orientdb-client-2.1.0.jar:2.1.0]

sajid2045 commented 9 years ago

@tglman Hi,

"The 2.1.1 is out, will be cool if you can try also with that version.

Checking the last code i saw that now you use every time no tx database, this is ok, but it's not suggested for creating edges between vertex, is it there any specific reason for that?"

We have Topic <----- Subscriber and we are adding / removing subscribers from multiple threads. However, using a NoTx seems to avoid increasing the version of Topic and we are avoiding the concurrent modification error. If I bring back Tx, it increases the version and we see many concurrent modification exception in load tests.

tglman commented 9 years ago

Hi @all

Regarding the performances, i'm running your tests, i ran the test: SubscriptionServiceImplTest#testTopicSubscribers setting the number of subscriber to 100000, and after i extracted from your code this two query: select in('subscribe').deviceId,in('ignore').out('user-device').deviceId from (select from Root where name = "cricket") and select in('subscribe').deviceId,in('subscribe').out('user-device').deviceId from (select from Root where name = "cricket")

i ran them against the server while the test was running and the response time was around 0.05 sec Do you have any other query to test ? do i've to run any other population test for reproduce the problem ?

for the error File '/usr/local/graphdb/default/databases/subscription-service/database.ocf' is locked by another process, maybe the database is in use by another process. double check that the server process is fully terminated before run another server.

The use of notTx for graph is ok for batch operation like import, but it's not suggested to be used in a live application the reason is that the edge creation is a multi-record operation that need a transaction to guarantee the consistency. In case of concurrent modification exception your code should retry the operation.

tglman commented 9 years ago

One Important point, is that today the enterprise monitor show the amount of allocated heap on the machine compared to the max allocable heap(the one set with -Xmx), but not the actual amount of the used heap, so after a load test the jvm has allocated the 100% of possible heap and the monitor show that, but that heap may not be used, we are working to have the actual amount of used heap in the next release.

It's more important to find slowdown after the load test though.

machinelearner commented 9 years ago

@sajid2045 In general, have been observing very similar issues w.r.t nodes freezing and restarts at both client and remote server. This is useful, thanks.

On the other hand, we've used a different way of dealing with the edge addition changing vertex document version. There is a way to configure conflict strategy,

ALTER DATABASE CONFLICTSTRATEGY content We've followed this since 1.7.* snapshot and later version and has been doing fairly ok.

You can read more about it here http://orientdb.com/docs/last/SQL-Alter-Database.html @tglman Please correct me if this is no more applicable.

sajid2045 commented 9 years ago

Don't see this on 2.1.2 so far. Closing it.