Closed nadestin closed 8 years ago
@nadestin I looked at the test which you provided, and I think you misunderstand the meaning of ACID. ACID properties work if an implementation does not have errors. But what you do, you break implementation by bytecode manipulation and you claim that if the code in database implementation has errors this erroneous code should still support ACID properties does not matter how it changed. @lvca I think that test which breaks database code to simulate error can not be treated as a critical bug. Obviously, ACID properties are implemented by code and we can not guarantee that if I break this code we will still have valid ACID properties.
@nadestin you throw NPE in .OByteBufferPool.acquireDirect() do you have argumentation why this code should throw NPE ? let suppose that in any RDBMS code which is responsible for rollback of data intentionally changed does it mean that rollback still should work ? Obviously, if you violate any data constraints and a system will throw exception data will be restored to initial point of view. but it does not mean that if we break infrastructure code as we wish , code (and ACID properties are implemented by code) still should work.
@nadestin I propose you to, for example, throw NPE on a cluster or on index manipulation methods if you want to emulate such behavior but I am sure that infrastructure properties work if and only if infrastructure code works correctly.
@nadestin why a NullPointerException from the com.orientechnologies.common.directmemory.OByteBufferPool.acquireDirect() ? Is it what you had in the OrientDB logs?
I believe that there is a misunderstanding. I have filled this bug report not out of curiosity. The example program shows a Real problem and Real data loss. This sample program is a test case that I made to clearly and undoubtedly reproduce a real problem. I am trying to give you a good starting point for debugging Transaction behavior in the OrientDB. I am trying to help you, not attack you.
Here is extract from the real Stack Trace that we see (unfortunately) almost every day:
Caused by: java.lang.Exception: Failed to commit transaction at com.riag.db.dbengine.DBConnection.commitTransaction(DrTax3_map:241) ~[DrTax3.jar:3.9.50 2016-09-09 17:55:21+0200] at com.riag.db.dbengine.skin.orientdb.OrientDbSkin.saveItemContentToStorage(OrientDbSkin.java:4039) ~[na:na] ... 38 common frames omitted Caused by: com.orientechnologies.orient.core.exception.OStorageException: Error during transaction commit DB name="DrTaxDB" DB name="DrTaxDB" at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_101] at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) ~[na:1.8.0_101] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) ~[na:1.8.0_101] at java.lang.reflect.Constructor.newInstance(Unknown Source) ~[na:1.8.0_101] at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.throwSerializedException(OChannelBinaryAsynchClient.java:428) ~[na:na] at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.handleStatus(OChannelBinaryAsynchClient.java:379) ~[na:na] at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:261) ~[na:na] at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:153) ~[na:na] at com.orientechnologies.orient.client.remote.OStorageRemote.beginResponse(OStorageRemote.java:2133) ~[na:na] at com.orientechnologies.orient.client.remote.OStorageRemote$28.execute(OStorageRemote.java:1316) ~[na:na] at com.orientechnologies.orient.client.remote.OStorageRemote$28.execute(OStorageRemote.java:1289) ~[na:na] at com.orientechnologies.orient.client.remote.OStorageRemote$2.execute(OStorageRemote.java:198) ~[na:na] at com.orientechnologies.orient.client.remote.OStorageRemote.baseNetworkOperation(OStorageRemote.java:230) ~[na:na] at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperation(OStorageRemote.java:195) ~[na:na] at com.orientechnologies.orient.client.remote.OStorageRemote.commit(OStorageRemote.java:1289) ~[na:na] at com.orientechnologies.orient.core.tx.OTransactionOptimistic.doCommit(OTransactionOptimistic.java:560) ~[na:na] at com.orientechnologies.orient.core.tx.OTransactionOptimistic.commit(OTransactionOptimistic.java:106) ~[na:na] at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2733) ~[na:na] at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2703) ~[na:na] at com.riag.db.dbengine.skin.orientdb.OrientDbSkin.commitTransaction(OrientDbSkin.java:924) ~[na:na] at com.riag.db.dbengine.skin.orientdb.OrientDbSkin.commitTransaction(OrientDbSkin.java:113) ~[na:na] at com.riag.db.dbengine.DBConnection.commitTransaction(DrTax3_map:185) ~[DrTax3.jar:3.9.50 2016-09-09 17:55:21+0200] ... 39 common frames omitted Caused by: java.lang.NullPointerException: null at com.orientechnologies.common.directmemory.OByteBufferPool.acquireDirect(OByteBufferPool.java:300) ~[na:na] at com.orientechnologies.orient.core.storage.cache.local.OWOWCache.cacheFileContent(OWOWCache.java:1281) ~[na:na] at com.orientechnologies.orient.core.storage.cache.local.OWOWCache.load(OWOWCache.java:636) ~[na:na] at com.orientechnologies.orient.core.storage.cache.local.twoq.O2QCache.updateCache(O2QCache.java:1111) ~[na:na] at com.orientechnologies.orient.core.storage.cache.local.twoq.O2QCache.doLoad(O2QCache.java:353) ~[na:na] at com.orientechnologies.orient.core.storage.cache.local.twoq.O2QCache.allocateNewPage(O2QCache.java:403) ~[na:na] at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperation.commitChanges(OAtomicOperation.java:419) ~[na:na] at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.endAtomicOperation(OAtomicOperationsManager.java:418) ~[na:na] at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.endStorageTx(OAbstractPaginatedStorage.java:3078) ~[na:na] at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.commit(OAbstractPaginatedStorage.java:1486) ~[na:na] at com.orientechnologies.orient.core.tx.OTransactionOptimistic.doCommit(OTransactionOptimistic.java:566) ~[na:na] at com.orientechnologies.orient.core.tx.OTransactionOptimistic.commit(OTransactionOptimistic.java:106) ~[na:na] at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2733) ~[na:na] at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2703) ~[na:na] at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.commit(ONetworkProtocolBinary.java:1358) ~[na:na] at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.executeRequest(ONetworkProtocolBinary.java:581) ~[na:na] at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.sessionRequest(ONetworkProtocolBinary.java:318) ~[na:na] at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.execute(ONetworkProtocolBinary.java:194) ~[na:na] at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:77) ~[na:na]
It is a bit long, but it clearly shows that source of exception was Line 300 in the class OByteBufferPool., in method acuireDirect().
That line looks like this:
final ByteBuffer db = bfh.buffer.duplicate();
Object bfh at that moment is definitely not null, otherwise exception would have been thrown much earlier. So it can only be member bfh.buffer that was not assigned (most likely due to problem allocating DirectMemory).
That is why I made a test case that simulates NPE exactly in this method. The only tiny difference that simulated exception is raised a bit earlier. Manipulating byte code was just a very simple and convenient way to simulate the problem. In the real application there are nor byte code manipulations at all.
The important point here is not why NPE happend. There is no way to guarantee that this or another exception will never be raised. Exception simple happens, that's life. The point is: Transaction was not rolled back as it should in that case.
Here I need to disagree with @laa when he assumes that I don't understand ACID principle. Please re-read ACID definition at least here: https://en.wikipedia.org/wiki/ACID
"A" stands for "Atomicy" and it is defined: (Start of citate) Atomicity requires that each transaction be "all or nothing": if one part of the transaction fails, then the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes. To the outside world, a committed transaction appears (by its effects on the database) to be indivisible ("atomic"), and an aborted transaction does not happen. (End of citate)
Sorry, but not atomic transaction cannot be called transaction at all.
So, please, consider this bug report as an attempt to help you to fix a problem and make OrientDB better. I am on your side, and trying to be constructive. I am sure that you know your code better and you can fix the problem faster.
Hi @nadestin I am really sorry for the misunderstanding. I suppose we may fix your issue without significant effort. Then we may continue to discuss atomicity concept if you find it reasonable. I see that you run a server, do you run embedded server or standalone one?
Our application can run standalone and in the client/server mode.
In standalone mode OrientDB is embedded in the client. In the Client/Server mode it is embedded in the server.
@nadestin so just to double check , this error happens in Client/Server mode when your application is embedded inside OrientDB server. Is it correct ? Could you send JVM settings which you use to run server /
No, This error happens in both modes. We saw it on standalone client and also in the client server mode. And the problem is not NPE Exception. Problem is that after this exception transaction is not properly rolled back what leads to the data loss, as shown in my test case. Rollback behavior should be fixed, NPE is not important here.
@nadestin As I wrote above this method is part of mechanics which is responsible for rollback to make rollback work we need to fix code which is responsible for rollback. Once again atomicity of database does not mean that it will immune to any error because if code responsible for correct error handling is broken it can not work correctly. Could you send me JVM settings which you use to run server ?
In that particular case Server was running on 32bit Java, with
-Xmx512m -XX:MaxDirectMemorySize=512m
On a 32 Bit Java in Windows that is a Maximum what we can get.
@nadestin I see, what is disk cache size or you use default settings ?
Default settings
@nadestin I see now. I need time to think how to fix that problem without applying of special settings but for your case, we may resolve an issue quickly. Could you use following additional settings -Dstorage.diskCache.bufferSize=448 -Dmemory.chunk.size=58720256
If it works, I provide a detailed explanation of reason of issue.
Ok, since it is only System properties, it can be done during program start before OrientDB is loaded. I will make these changes for all 32bit clients and servers. It will take some time before I get a feedback, but consequence of this error is data loss, and that still worries me.
@nadestin It worries me too that is why we are going to provide a solution which will not require manual configuration. But first of all, I need to now that I identified issue correctly.
So the problem seems to reside in the 32-bit JVM, right?
Null pointer Exception during commit was observed only in 32bit JVM (yet), but inconsistent transaction rollback can be reproduced on any JVM.
@nadestin as I mentioned above if code responsible for rollback is broken obviously rollback can not happen. So claim that npe from any part of code should be correctly handled is incorrect because of we need to make rollbacks without NPE to make them to work correctly.
@nadestin @lvca the problem as I can see it is following. The system has several GB of memory. OrientDB detects that and uses disk cache with few GB of memory. But it may address only hundred of MB of memory and as result if it try to acquire more for it needs system fails.
One of the approaches to resolving this situation and any other similar once is to allocate all memory required by disk cache during system start in such case if there will be any problems with system settings this will be known before head and data will not be broken. Sure not an ideal solution but it will prevent surprises in the middle of execution of the system. And of course we will try to detect 32 JVM inside our system and decrease disk cache size, but I am not sure now if that is possible.
AFAIK, there are no bullet proof way to figure out if JVM is 32 bit or 64, but following code is covering most real world cases. For some exotic JVMs another way should be found.
public static int getJavaBitWidth() { // Figure out whether bit width of running JVM // Most of JREs support property "sun.arch.data.model" which is exactly what we need here String dataModel = System.getProperty("sun.arch.data.model", "32"); // By default assume 32bit int size = 32; try { size = Integer.parseInt(dataModel); } catch(Throwable t) { // Ignore } return size; }
By the way to @laa. The proposed parameters for 32bit JVM seem to work. I have performed tests with big databases that had crashed on 32bit before. With your settings for storage.diskCache.bufferSize and memory.chunk.size I was able to complete tests without errors. We are still waiting for response from the field, but that workaround looks promising.
Why don't you publish methods of calculating values in the Manual. When I install Application I usually know startup parameters, so I could properly initialize these properties without hacking around. Could you share how did you calculate values 448 and 58720256?
Hi @nadestin . A calculation is simple. The maximum amount of direct memory which you may allocate is 512MB I put 64MB for service needs which are more than enough and also split 448 MBs into 8 chunks. So I get 56 MB for each chunk or 58720256 bytes.
@nadestin once you confirm that issue is fixed I will provide fix.
Hello @laa. I confirm, with adjusted parameters we have no more cases of NPE during Direct Memory allocations on 32bit JVMs. (On 64bit JVMS it is working without parameter adjustments).
Your workaround had fixed the problem.
Thank you for your help.
Hi @nadestin , OK we will provide complete fix during next week. Thank you very much for your report and your time.
@nadestin could you build latest 2.2.x version and run it on the same case without any additional parameters ?
Hi @nadestin , I close this issue, please reopen issue if you see it again.
Sorry, I was on vacation last week, will try it and report status.
OrientDB Version, operating system, or hardware.
Operating System
Expected behavior and actual behavior
Assume Transaction that include changes in the separate blob cluster with ORecordBytes records. and changes in another ODocument record that has a field es linkedlist to these blob records. In that transaction two existing blobs were deleted and new one has been inserted. linked list field has been updated to point to the new blob and then transaction commit is called. During commit an NullPointer exception has been raised due to (whatever reason, it doesn't matter actually). Expected behavior would be that new blob record is not inserted, two old blob records are not deleted and ODocument record still points to the old two records.
Actual behavior: Old 2 Blob records are deleted. New blob record is (half inserted). You can see it if you ask "clusters" in the console but you don't see it if you run statement "select from cluster:blob_cluster". ODocument record still points to a non-existsing Blob records if you issue statement "select from #. In Java code attempt to get filed value returns a List with 2 null elements.
As a result we have a full loss of blob data instead of original values.
Steps to reproduce the problem
I have compiled a very simple test program that easily reproduces that problem.
If you run this program it will create new Database, create one class "TestClass" with 2 fields: itemId and blob. Field blob is a linked list. Additionally it creates new cluster called "TestClass_blob" for storing blobs in it.
method writeDataAndCheck() will add one record with itemId = "1" and blob that refers to a chunk1 (512Kb) and Chunk2 (256Kb). Transaction is committed and reloaded to test that all data were saved correctly.
The fun starts in the simulateErrorAndCheck() method.
Here it will load that testclass record into ODocument and then start transaction that replaces chunk1 and chunk2 with new chunk3 (1MB). In Line 264 you can control whether exception in the commit should be simulated or not. If exception is simulated it will printed to stderr and then database will be closed. and storage will be closed to release all in memory caches.
Then program tries to reopen database and read contents. You can see that if exception is happening during commit, contents of chunk1 and chunk 2 is lost. Chunk3 is actually stored in the cluster but it cannot be accessed anymore.
To me this is a showstopper bug. Blobs are lost without recovery. OrientTest.zip