microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 399 forks source link

Lots of InvalidOperationException: Transaction is committing or rolling back #895

Open nrandell opened 7 years ago

nrandell commented 7 years ago

This is related to this stack overflow question (https://stackoverflow.com/questions/45960452/how-to-find-out-why-debug-shows-many-exceptions-in-service-fabric)

Using the latest libraries and local dev cluster 5.7.198.9494 in both 1 and 5 node configurations I'm seeing an InvalidOperationException occur which gives the reason: Transaction NNNNNNN is committing or rolling back or has already committed or rolled back

This doesn't go back up to my code, but makes for a noisy output window and makes me wonder what is going on. I guess it is perfectly safe, but it would be nice to understand what is going on and why - and hopefully prevent it happening!

Update I started tracking FirstChanceExceptions and found that it is also happening in production.

Another update Easy to reproduce - create a new service fabric application with a stateful service using the latest VS2017 and SDK. Start debugging and stop on Invalid operation exception - one occurs every time the transaction is disposed.

xaostation commented 7 years ago

I'm getting this as well and I have a ton of code in case you need a reproduction scenario. Essentially from what I can tell it happens on Dispose of the transaction after a CommitAsync or Abort.

nrandell commented 7 years ago

Hey @navyahmed, what's happening with this? I can see a number of people have been assigned to look at this, but the silence is a bit worrying. Any chance of a quick update?

navyahmed commented 7 years ago

yep, will follow-up tomorrow AM - PST

sumukhs commented 7 years ago

Hi @nrandell , Thanks for reporting the issue. This is a first chance exception that is thrown and caught inside our transaction class in the happy path when transactions are disposed.

It was introduced in the latest version of the runtime unintentionally and we have logged a bug for it. We will fix it in the upcoming release and you should stop seeing these in the debugger.

sumukhs commented 7 years ago

Hi @nrandell , To provide an update, we have fixed this in the runtime version "6.0.219.9494", which should be available soon.

aarms commented 6 years ago

The comment on Oct 25th says this should have been resolved in "6.0.219.9494" runtime, however, the ticket to track the issue is still open/fix-coming state.

My team faced the problem on 6.0.219.9494.

Any updates?

sumukhs commented 6 years ago

@aarms - This issue is tracking the fix for first chance exceptions thrown by the transaction, which show up in visual studio or any debugger when attached to the process. Are you referring to the same issue or something else?

MattDarg commented 5 years ago

Hi,

I'm getting this as an unhandled exception. The code is fairly straight-forward, although I'm using the same transaction across two reliable dictionaries. I'm also doing this in parallel for multiple keys / values:

            using ( var tx = stateManager.CreateTransaction() )
            {
                await cacheStorage.SetAsync(tx, key, value, WriteOperationTimeout, CancellationToken.None);
                await cacheMetaStorage.SetAsync(tx, key, new CacheMetadata(), WriteOperationTimeout, CancellationToken.None);
                await tx.CommitAsync();
            }

Interestingly, in one example I just checked, the key / value does seem to be in the collection, even though the exception was thrown.

System.InvalidOperationException: Transaction 131939263794928189 is committing or rolling back or has already committed or rolled back at System.Fabric.Store.TStore5.<AddOrUpdateAsync>d__227.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ServiceFabric.Data.Collections.DistributedDictionary2.d__98.MoveNext()

rahku commented 5 years ago

According to guidelines you should handle InvalidOperationException

MattDarg commented 5 years ago

Thanks, yes I've added some code to retry on the server, although System.InvalidOperationException is quite a generic exception to handle- I guess I should probably wrap that and throw a more specific exception to the client (where it doesn't make sense for the server to retry e.g. in case of role change).

The other concern is that it's not clear if the transaction was completed or not. It doesn't matter in my current use-case, but let's say I was incrementing some kind of global reference / resource lock count with AddOrUpdate; if I get this exception is it guaranteed that if I retry the count won't be increased twice?

MattDarg commented 5 years ago

Looking at the CommitAsync docs they suggest that InvalidOperationException is due to a user error rather than TransactionFaultedException which suggests a system caused fault which you should retry?

PTC-JoshuaMatthews commented 4 years ago

I get this error long before committing the transaction while I am still populating my collections. Retrying doesn't help as it happens consistently. It seems that attempting to add too much data to too many collections in a single transaction causes this. The same code runs fine if i wrap a separate transaction around the data load into each collection.

It says in the doc this error means that it is highly likely to be an issue with my code, but smaller datasets work fine it seems to just be an issue with the large values.

PTC-JoshuaMatthews commented 4 years ago

For anyone who is seeing the same issues with large datasets, bumping up the MinLogSizeInMB to 4096 completely resolved my errors. Apparently just like the docs say if the transaction log decides to truncate in the middle of your transaction it will roll back the transaction, and if the operation is so large that is causes a truncation every time it will never succeed. MinLogSizeInMB forced it to let the log grow larger before truncating so it can accommodate all of the operations in my transaction.