orientechnologies / orientdb

OrientDB is the most versatile DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master), supports SQL, ACID Transactions, Full-Text indexing and Reactive Queries.
https://orientdb.dev
Apache License 2.0
4.74k stars 870 forks source link

Transaction rollback is inconsistent if exception raised during commit #6702

Closed nadestin closed 8 years ago

nadestin commented 8 years ago

OrientDB Version, operating system, or hardware.

Assume Transaction that include changes in the separate blob cluster with ORecordBytes records. and changes in another ODocument record that has a field es linkedlist to these blob records. In that transaction two existing blobs were deleted and new one has been inserted. linked list field has been updated to point to the new blob and then transaction commit is called. During commit an NullPointer exception has been raised due to (whatever reason, it doesn't matter actually). Expected behavior would be that new blob record is not inserted, two old blob records are not deleted and ODocument record still points to the old two records.

Actual behavior: Old 2 Blob records are deleted. New blob record is (half inserted). You can see it if you ask "clusters" in the console but you don't see it if you run statement "select from cluster:blob_cluster". ODocument record still points to a non-existsing Blob records if you issue statement "select from #. In Java code attempt to get filed value returns a List with 2 null elements.

As a result we have a full loss of blob data instead of original values.

Steps to reproduce the problem

I have compiled a very simple test program that easily reproduces that problem.

  1. Extract attached zip into a empty directory. (it contains 2 Eclipse Maven projects (NPESimulatorAgent and OOMTest)
  2. Import bothe projects into empty eclipse workspace.
  3. Run Maven "package" on the NPESimulatorAgent project. This will generate Java agent.jar that can throw NullPointerException from the com.orientechnologies.common.directmemory.OByteBufferPool.acquireDirect() method. NPESimulatorAgent.jar should appear in the NPESimulatorAgent directory.
  4. Now open class OutOufMemoryTest in the project OOMTest. This class has a main method and it can be launched using LaunchConfiguration OutOfMemoryTest.launch.

If you run this program it will create new Database, create one class "TestClass" with 2 fields: itemId and blob. Field blob is a linked list. Additionally it creates new cluster called "TestClass_blob" for storing blobs in it.

method writeDataAndCheck() will add one record with itemId = "1" and blob that refers to a chunk1 (512Kb) and Chunk2 (256Kb). Transaction is committed and reloaded to test that all data were saved correctly.

The fun starts in the simulateErrorAndCheck() method.

Here it will load that testclass record into ODocument and then start transaction that replaces chunk1 and chunk2 with new chunk3 (1MB). In Line 264 you can control whether exception in the commit should be simulated or not. If exception is simulated it will printed to stderr and then database will be closed. and storage will be closed to release all in memory caches.

Then program tries to reopen database and read contents. You can see that if exception is happening during commit, contents of chunk1 and chunk 2 is lost. Chunk3 is actually stored in the cluster but it cannot be accessed anymore.

To me this is a showstopper bug. Blobs are lost without recovery. OrientTest.zip

andrii0lomakin commented 8 years ago

@nadestin I looked at the test which you provided, and I think you misunderstand the meaning of ACID. ACID properties work if an implementation does not have errors. But what you do, you break implementation by bytecode manipulation and you claim that if the code in database implementation has errors this erroneous code should still support ACID properties does not matter how it changed. @lvca I think that test which breaks database code to simulate error can not be treated as a critical bug. Obviously, ACID properties are implemented by code and we can not guarantee that if I break this code we will still have valid ACID ​properties.

andrii0lomakin commented 8 years ago

@nadestin you throw NPE in .OByteBufferPool.acquireDirect() do you have argumentation why this code should throw NPE ? let suppose that in any RDBMS code which is responsible for rollback of data intentionally changed does it mean that rollback still should work ? Obviously, if you violate any data constraints and a system will throw exception data will be restored to initial point of view. but it does not mean that if we break infrastructure code as we wish , code (and ACID properties are implemented by code) still should work.

andrii0lomakin commented 8 years ago

@nadestin I propose you to, for example, throw NPE on a cluster or on index manipulation methods if you want to emulate such behavior but I am sure that infrastructure properties work if and only if infrastructure code works correctly.

lvca commented 8 years ago

@nadestin why a NullPointerException from the com.orientechnologies.common.directmemory.OByteBufferPool.acquireDirect() ? Is it what you had in the OrientDB logs?

nadestin commented 8 years ago

I believe that there is a misunderstanding. I have filled this bug report not out of curiosity. The example program shows a Real problem and Real data loss. This sample program is a test case that I made to clearly and undoubtedly reproduce a real problem. I am trying to give you a good starting point for debugging Transaction behavior in the OrientDB. I am trying to help you, not attack you.

Here is extract from the real Stack Trace that we see (unfortunately) almost every day:

Caused by: java.lang.Exception: Failed to commit transaction at com.riag.db.dbengine.DBConnection.commitTransaction(DrTax3_map:241) ~[DrTax3.jar:3.9.50 2016-09-09 17:55:21+0200] at com.riag.db.dbengine.skin.orientdb.OrientDbSkin.saveItemContentToStorage(OrientDbSkin.java:4039) ~[na:na] ... 38 common frames omitted Caused by: com.orientechnologies.orient.core.exception.OStorageException: Error during transaction commit DB name="DrTaxDB" DB name="DrTaxDB" at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_101] at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) ~[na:1.8.0_101] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) ~[na:1.8.0_101] at java.lang.reflect.Constructor.newInstance(Unknown Source) ~[na:1.8.0_101] at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.throwSerializedException(OChannelBinaryAsynchClient.java:428) ~[na:na] at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.handleStatus(OChannelBinaryAsynchClient.java:379) ~[na:na] at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:261) ~[na:na] at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:153) ~[na:na] at com.orientechnologies.orient.client.remote.OStorageRemote.beginResponse(OStorageRemote.java:2133) ~[na:na] at com.orientechnologies.orient.client.remote.OStorageRemote$28.execute(OStorageRemote.java:1316) ~[na:na] at com.orientechnologies.orient.client.remote.OStorageRemote$28.execute(OStorageRemote.java:1289) ~[na:na] at com.orientechnologies.orient.client.remote.OStorageRemote$2.execute(OStorageRemote.java:198) ~[na:na] at com.orientechnologies.orient.client.remote.OStorageRemote.baseNetworkOperation(OStorageRemote.java:230) ~[na:na] at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperation(OStorageRemote.java:195) ~[na:na] at com.orientechnologies.orient.client.remote.OStorageRemote.commit(OStorageRemote.java:1289) ~[na:na] at com.orientechnologies.orient.core.tx.OTransactionOptimistic.doCommit(OTransactionOptimistic.java:560) ~[na:na] at com.orientechnologies.orient.core.tx.OTransactionOptimistic.commit(OTransactionOptimistic.java:106) ~[na:na] at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2733) ~[na:na] at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2703) ~[na:na] at com.riag.db.dbengine.skin.orientdb.OrientDbSkin.commitTransaction(OrientDbSkin.java:924) ~[na:na] at com.riag.db.dbengine.skin.orientdb.OrientDbSkin.commitTransaction(OrientDbSkin.java:113) ~[na:na] at com.riag.db.dbengine.DBConnection.commitTransaction(DrTax3_map:185) ~[DrTax3.jar:3.9.50 2016-09-09 17:55:21+0200] ... 39 common frames omitted Caused by: java.lang.NullPointerException: null at com.orientechnologies.common.directmemory.OByteBufferPool.acquireDirect(OByteBufferPool.java:300) ~[na:na] at com.orientechnologies.orient.core.storage.cache.local.OWOWCache.cacheFileContent(OWOWCache.java:1281) ~[na:na] at com.orientechnologies.orient.core.storage.cache.local.OWOWCache.load(OWOWCache.java:636) ~[na:na] at com.orientechnologies.orient.core.storage.cache.local.twoq.O2QCache.updateCache(O2QCache.java:1111) ~[na:na] at com.orientechnologies.orient.core.storage.cache.local.twoq.O2QCache.doLoad(O2QCache.java:353) ~[na:na] at com.orientechnologies.orient.core.storage.cache.local.twoq.O2QCache.allocateNewPage(O2QCache.java:403) ~[na:na] at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperation.commitChanges(OAtomicOperation.java:419) ~[na:na] at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.endAtomicOperation(OAtomicOperationsManager.java:418) ~[na:na] at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.endStorageTx(OAbstractPaginatedStorage.java:3078) ~[na:na] at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.commit(OAbstractPaginatedStorage.java:1486) ~[na:na] at com.orientechnologies.orient.core.tx.OTransactionOptimistic.doCommit(OTransactionOptimistic.java:566) ~[na:na] at com.orientechnologies.orient.core.tx.OTransactionOptimistic.commit(OTransactionOptimistic.java:106) ~[na:na] at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2733) ~[na:na] at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2703) ~[na:na] at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.commit(ONetworkProtocolBinary.java:1358) ~[na:na] at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.executeRequest(ONetworkProtocolBinary.java:581) ~[na:na] at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.sessionRequest(ONetworkProtocolBinary.java:318) ~[na:na] at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.execute(ONetworkProtocolBinary.java:194) ~[na:na] at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:77) ~[na:na]

It is a bit long, but it clearly shows that source of exception was Line 300 in the class OByteBufferPool., in method acuireDirect().

That line looks like this:

final ByteBuffer db = bfh.buffer.duplicate();

Object bfh at that moment is definitely not null, otherwise exception would have been thrown much earlier. So it can only be member bfh.buffer that was not assigned (most likely due to problem allocating DirectMemory).

That is why I made a test case that simulates NPE exactly in this method. The only tiny difference that simulated exception is raised a bit earlier. Manipulating byte code was just a very simple and convenient way to simulate the problem. In the real application there are nor byte code manipulations at all.

The important point here is not why NPE happend. There is no way to guarantee that this or another exception will never be raised. Exception simple happens, that's life. The point is: Transaction was not rolled back as it should in that case.

Here I need to disagree with @laa when he assumes that I don't understand ACID principle. Please re-read ACID definition at least here: https://en.wikipedia.org/wiki/ACID

"A" stands for "Atomicy" and it is defined: (Start of citate) Atomicity requires that each transaction be "all or nothing": if one part of the transaction fails, then the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes. To the outside world, a committed transaction appears (by its effects on the database) to be indivisible ("atomic"), and an aborted transaction does not happen. (End of citate)

Sorry, but not atomic transaction cannot be called transaction at all.

So, please, consider this bug report as an attempt to help you to fix a problem and make OrientDB better. I am on your side, and trying to be constructive. I am sure that you know your code better and you can fix the problem faster.

andrii0lomakin commented 8 years ago

Hi @nadestin I am really sorry for the misunderstanding. I suppose we may fix your issue without significant effort. Then we may continue to discuss atomicity concept if you find it reasonable. I see that you run a server, do you run embedded server or standalone one?

nadestin commented 8 years ago

Our application can run standalone and in the client/server mode.

In standalone mode OrientDB is embedded in the client. In the Client/Server mode it is embedded in the server.

andrii0lomakin commented 8 years ago

@nadestin so just to double check , this error happens in Client/Server mode when your application is embedded inside OrientDB server. Is it correct ? Could you send JVM settings which you use to run server /

nadestin commented 8 years ago

No, This error happens in both modes. We saw it on standalone client and also in the client server mode. And the problem is not NPE Exception. Problem is that after this exception transaction is not properly rolled back what leads to the data loss, as shown in my test case. Rollback behavior should be fixed, NPE is not important here.

andrii0lomakin commented 8 years ago

@nadestin As I wrote above this method is part of mechanics which is responsible for rollback to make rollback work we need to fix code which is responsible for rollback. Once again atomicity of database does not mean that it will immune to any error because if code responsible for correct error handling is broken it can not work correctly. Could you send me JVM settings which you use to run server ?

nadestin commented 8 years ago

In that particular case Server was running on 32bit Java, with

-Xmx512m -XX:MaxDirectMemorySize=512m

On a 32 Bit Java in Windows that is a Maximum what we can get.

andrii0lomakin commented 8 years ago

@nadestin I see, what is disk cache size or you use default settings ?

nadestin commented 8 years ago

Default settings

andrii0lomakin commented 8 years ago

@nadestin I see now. I need time to think how to fix that problem without applying of special settings but for your case, we may resolve an issue quickly. Could you use following additional settings -Dstorage.diskCache.bufferSize=448 -Dmemory.chunk.size=58720256 If it works, I provide a detailed explanation of reason of issue.

nadestin commented 8 years ago

Ok, since it is only System properties, it can be done during program start before OrientDB is loaded. I will make these changes for all 32bit clients and servers. It will take some time before I get a feedback, but consequence of this error is data loss, and that still worries me.

andrii0lomakin commented 8 years ago

@nadestin It worries me too that is why we are going to provide a solution which will not require manual configuration. But first of all, I need to now that I identified issue correctly.

lvca commented 8 years ago

So the problem seems to reside in the 32-bit JVM, right?

nadestin commented 8 years ago

Null pointer Exception during commit was observed only in 32bit JVM (yet), but inconsistent transaction rollback can be reproduced on any JVM.

andrii0lomakin commented 8 years ago

@nadestin as I mentioned above if code responsible for rollback is broken obviously rollback can not happen. So claim that npe from any part of code should be correctly handled is incorrect because of we need to make rollbacks without NPE to make them to work correctly.

andrii0lomakin commented 8 years ago

@nadestin @lvca the problem as I can see it is following. The system has several GB of memory. OrientDB detects that and uses disk cache with few GB of memory. But it may address only hundred of MB of memory and as result if it try to acquire more for it needs system fails.

One of the approaches to resolving this situation and any other similar once is to allocate all memory required by disk cache during system start in such case if there will be any problems with system settings this will be known before head and data will not be broken. Sure not an ideal solution but it will prevent surprises in the middle of execution of the system. And of course we will try to detect 32 JVM inside our system and decrease disk cache size, but I am not sure now if that is possible.

nadestin commented 8 years ago

AFAIK, there are no bullet proof way to figure out if JVM is 32 bit or 64, but following code is covering most real world cases. For some exotic JVMs another way should be found.

public static int getJavaBitWidth() { // Figure out whether bit width of running JVM // Most of JREs support property "sun.arch.data.model" which is exactly what we need here String dataModel = System.getProperty("sun.arch.data.model", "32"); // By default assume 32bit int size = 32; try { size = Integer.parseInt(dataModel); } catch(Throwable t) { // Ignore } return size; }

By the way to @laa. The proposed parameters for 32bit JVM seem to work. I have performed tests with big databases that had crashed on 32bit before. With your settings for storage.diskCache.bufferSize and memory.chunk.size I was able to complete tests without errors. We are still waiting for response from the field, but that workaround looks promising.

Why don't you publish methods of calculating values in the Manual. When I install Application I usually know startup parameters, so I could properly initialize these properties without hacking around. Could you share how did you calculate values 448 and 58720256?

andrii0lomakin commented 8 years ago

Hi @nadestin . A calculation is simple. The maximum amount of direct memory which you may allocate is 512MB I put 64MB for service needs which are more than enough and also split 448 MBs into 8 chunks. So I get 56 MB for each chunk or 58720256 bytes.

andrii0lomakin commented 8 years ago

@nadestin once you confirm that issue is fixed I will provide fix.

nadestin commented 8 years ago

Hello @laa. I confirm, with adjusted parameters we have no more cases of NPE during Direct Memory allocations on 32bit JVMs. (On 64bit JVMS it is working without parameter adjustments).

Your workaround had fixed the problem.

Thank you for your help.

andrii0lomakin commented 8 years ago

Hi @nadestin , OK we will provide complete fix during next week. Thank you very much for your report and your time.

andrii0lomakin commented 8 years ago

@nadestin could you build latest 2.2.x version and run it on the same case without any additional parameters ?

andrii0lomakin commented 8 years ago

Hi @nadestin , I close this issue, please reopen issue if you see it again.

nadestin commented 8 years ago

Sorry, I was on vacation last week, will try it and report status.