quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.57k stars 2.63k forks source link

Narayana JTA: expose property to set xaAssumeRecoveryComplete #35806

Open turing85 opened 1 year ago

turing85 commented 1 year ago

Description

There is a gap in the XA restore process. If all resources have acknowledged the transaction, narayana deletes the transaction from its transaction store. If the application dies right in this moment and restarts, narayana still thinks that the XA transaction is not finished, but the transactions on the resources were already committed, and the resources do not have any memory about those transactions. This, in return, leads the a log message similar to:

... ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord ...

This message will appear periodically with log-level WARN, until the correspoding transaction log is removed.

This link (access.redhat.com) suggests setting system-property com.arjuna.ats.jta.xaAssumeRecoveryComplete to true.

While we can start the application with java ... -Dcom.arjuna.ats.jta.xaAssumeRecoveryComplete=true ..., it would be convenient to have a configuration property like quarkus.transaction-manager.xa-assume-recovery-complete that can be set in application.properties/application.yml.

Implementation ideas

No response

quarkus-bot[bot] commented 1 year ago

/cc @mmusgrov (narayana)

turing85 commented 1 year ago

@gsmet If the corresponding PR gets merged, I'd like to have a ~downstream~ backport to 2.13 and 2.16. Would this be possible? ~Only change required is adding import java.util.List; in file NarayanaJtaProcessor.java on both branches.~

mmusgrov commented 1 year ago

Doing this is risky because you may end up deleting logs for in doubt transactions if the assumption is wrong which could result in inconsistent data.

We added the interface org.jboss.tm.XAResourceWrapper to mitigate against this and it should be implemented by the resources that get passed to the transaction extension.

turing85 commented 1 year ago

Okay... then I am out. I took a quick look at the interface and I have no idea how to implement this and what resource have to implement this interface. I am going to close my PR. @mmusgrov will you take over?

mmusgrov commented 1 year ago

@turing85 Not really. We do not know which resources might have known about the transaction unless our transaction record can identify the resource. If the resource that's registered with us implements org.jboss.tm.XAResourceWrapper then we know its JNDI name and if we know that we have asked the resource, corresponding to the JNDI name, for its in doubt Xids and if it did not know about this transaction then we may infer that it must have completed.

But we do not control the resource registrations, that is the job of the Agroal subsystem so only Agroal can help in this situation. But Agroal does not know either because Quarkus does not initialise the integration code with the datasource name. @barreiro is that a fair description of the situation and if so can anyone suggest what it would take to ensure that Agroal can be supplied with that name.

turing85 commented 1 year ago

@mmusgrov I am very sure that the resource causing the issue is the JMS connection, so no Agroal is involved.

What does this exactly mean? Should I re-open the PR?

mmusgrov commented 1 year ago

The issue needs to be fixed by whatever component is supplying us with the resource.

turing85 commented 1 year ago

That would be quarkus-artemis. But quarkus-artemis supplies a ActiveMQConnectionFactory extends JNDIStorable implements ConnectionFactoryOptions, Externalizable, ConnectionFactory, XAConnectionFactory, AutoCloseable. How would I implement that interface then? I am a little bit at a loss here.

mmusgrov commented 1 year ago

Maybe the thing that implements the XASession methods can check. You'd need to do it just for the resources that you register with us since end user apps would not want the wrapper. I assumed that the route from ActiveMQConnectionFactory to the call to register the resource with us is not obvious so I did not look.

turing85 commented 1 year ago

It is obvious. CF -> XASession -> XAResource. Do you have an example of, for example, an agroal datasource implementing the interface?

mmusgrov commented 1 year ago

Agroal does not implement it, I was hoping that this problem would be a trigger for Luis and the quarkus team to support the interface because it's a problem. But IronJacamar does and I believe there is some work in Quarkus to support that.

turing85 commented 1 year ago

Well in this case, I propose we merge my original PR until this issue is resolved for good. Having these logs popping up over and over again is noise, and setting the property does not work in native mode.

mmusgrov commented 1 year ago

The window where this crash can happen is very small:

  1. commit resource.
  2. application crashes.
  3. delete log

So I'd be surprised to see the problem popping up over and over again, perhaps there is another root cause analysis or the reproducer is contrived.

The consequences of leaving the TM log lying around is rare noise. The consequence of using xaAssumeRecoveryComplete and getting the assumption being wrong is data integrity.

I'd definitely want to avoid the latter so I'd resist approving the PR.

turing85 commented 1 year ago

Multiply the window times 100. We have multiple services producing those logs.

mmusgrov commented 1 year ago

I'd still prefer noise over loss of data integrity so lets see if we can up the priority of implementing XAResourceWrapper.

mmusgrov commented 1 year ago

For completeness, as an aside, the name of the datasource doesn't have to be JNDI, which I'm not sure if Quarkus supports, it just needs to be unique.

turing85 commented 1 year ago

Nobody is forced to set this option. We are right now forced to have this noise. And setting said property produces a very clear log message, stating that the transaction will be assumed to be completed.

mmusgrov commented 1 year ago

I'm not convinced every developer knows the value of maintaining ACID semantics.

mmusgrov commented 1 year ago

I'd also check your environment to determine why so many applications are crashing or are being scaled down with pending work.

mmusgrov commented 2 months ago

@turing85 As mentioned, even though the property xaAssumeRecoveryComplete requires that wrapped resources have JNDI names any unique value (amongst all resources) will do. So I agree that we could add support for XAResourceWrapper to the jta extension now so that if, in the future, Agroal is able to wrap its resources and provide such a unique value then the issue would not be blocked waiting for the jta extension part of the feature.