Open timw opened 2 months ago
This appears to do what I want:
--- forkSrcPrefix/core/src/main/java/com/orientechnologies/orient/core/tx/OTransactionRealAbstract.java
+++ forkDstPrefix/core/src/main/java/com/orientechnologies/orient/core/tx/OTransactionRealAbstract.java
@@ -624,10 +624,16 @@ public abstract class OTransactionRealAbstract extends OTransactionAbstract
public void resetAllocatedIds() {
for (Map.Entry<ORID, ORecordOperation> op : allEntries.entrySet()) {
if (op.getValue().type == ORecordOperation.CREATED) {
+ ORID current = op.getValue().getRID().copy();
ORecordId oldNew =
new ORecordId(op.getKey().getClusterId(), op.getKey().getClusterPosition());
updateIdentityAfterCommit(op.getValue().getRID(), oldNew);
- updatedRids.remove(op.getValue().getRID());
+ // Clear previous update RID chain, which will include mapping created by updateIdentityAfterCommit
+ while (current != null) {
+ current = updatedRids.remove(current);
+ }
+ database.assignAndCheckCluster(op.getValue().getRecord(), null);
+ updatedRids.put(op.getValue().getRID().copy(), oldNew);
}
}
}
I think the same patch would need to be applied to OMicroTransaction
?
Hi @timw,
For the stalled transaction sequential it should be there a reset when the transaction fail, that free the sequential position, if you want to track it are the call to context.releasePromises();
that will finally call OTransactionSequenceManager.notifyFailure
, will double check that is is missing somewhere .
For the problem with the ids reset I do remember I've fixed a similar problem in 3.2.x series a while ago, but the only commit I can find are https://github.com/orientechnologies/orientdb/commit/4ef8c449627cc92e5afb5a8f546bb90f39348cab and b1375cb81a8a3b0fdc9ff0df700995c63eb90347 that seems to be already part of your changes.
Removing all the mapping in updatedRids
it may be a bit dangerous, because in the records there may still be ids pointing to something like 10:-10
that when serialized will not find the mapping anymore to the real id and fail/store a stale pointer.
The process that happen on commit will be transforming an id from 10:-5
or -1:-10
to something like 10:20
( often the process is multiple steps: -1:-5
-> 10:-5
-> 10:20
and all the steps are in the updatedRids
mapping ).
When reverting this id assignment because of concurrent assignation of clusterPostion
(the second part of the id so in 10:20
is 20
) the correct way would be reset the id to something like 10:-x
to trigger the next id association, and make sure that all the ids mapping (updatedRids
) now have as end point the resetted id 10:-x
, making sure that the old assigned id is now mapping to the new temporary.
So if before resetting in the updatedRids
was there:
-1:-10 -> 10:-10 //First temporary transition
10:-10 -> 10:50 //Final Id assignation that failed to allocate
after reset the map should look something like
-1:-10 -> 10:-10 //First temporary transition
10:50 -> 10:-10 //Previous Id assignation now point to the temporary
so need to remove the 10:-10 -> 10:50
and add 10:50 -> 10:-10
.
the id reset should not reset the clusterId ( int the 10:20
is the 10
part ) if is doing that probably there is something wrong.
Yes OMicroTransaction
should have the exact same logic (in 3.2.x it merged back in the normal transaction)
(P.S. I know that many thing I wrote here you already know, making it clear to everyone )
Thanks @tglman - your explanation was very helpful in understanding the intent here (although the updated rids map works in the opposite direction to your examples, tracking back references, so it's a bit confusing).
Having another look at 3.2/develop, it looks like you already fixed this in a667b24, which I should have noticed earlier, so this looks like it's only a 3.1.x bug. That patch appears to work for me.
With the patch applied, the net effect is to set the id to current_cluster:-x
, and the updatedRids
map is unchanged (since that id is already present in the map, and the mapping from it to the initial id is restored after the call to updateIdentityAfterCommit
.
OrientDB Version: 3.1.21 (also present in 3.2)
Java Version: 1.8
OS: Linux/Mac
We have experienced production outages where all commit activity in a distributed database stalls, which by analysing heap dumps we have isolated to all of the
promisedSequential
slots inOTransactionSequenceManager
being occupied by transactions that have long since completed.We've reproduced the effect locally by running two client threads that rapidly (i.e. with no pauses) query and update a shared record, and create a separate record at the same time.
By instrumenting with logging we isolated the slot leak behaviour to when the transaction fails with a
OClusterDoesNotExistException: Cluster id '-1' is outside the of range of configured
.When this error occurs, the transaction in the sequence manager is never cleared (either by
notifySuccess
ornotifyFailure
).The root cause of the
OClusterDoesNotExistException
appears to be:ONewDistributedTransactionManager.retriedCommit
with aODistributedRecordLockedException
OTransactionRealAbstract.resetAllocatedIds
, which resets the RID for the object creation (to-1,-2
) and invokesupdateIdentityAfterCommit
.ONewDistributedTransactionManager.getInvolvedClusters
fails, because one of the Record has a-1
cluster id.It looks like a call to
ODatabaseDocumentAbstract.assignAndCheckCluster
is missing during the reset (which is what happens in the original assignment inOTransactionOptimisticServer.assignClusters
, and introducing that call does appear to fix the problem.What I'm not clear on is whether the
ODatabaseDocumentAbstract.assignAndCheckCluster
call should be added toOTransactionRealAbstract.updateIdentityAfterCommit
or just toOTransactionRealAbstract.resetAllocatedIds
, and how exactlyupdatedRids
should be corrected when that happens.