DefaultMessageListenerContainer: more than 9 concurrent consumers with ActiveMQ Artemis leads to non-working XA transactions

turing85 commented 7 months ago

Description

When more than 9 concurrent consumers in conjunction with the default taks executor is used. See: https://github.com/spring-projects/spring-framework/blob/ea52ecc5e08a14869ec351af38481c4008c8bcd8/spring-jms/src/main/java/org/springframework/jms/listener/DefaultMessageListenerContainer.java#L764-L768

In this case, we loose some XA-properties. When the transaction is rolled back, the message is consumed and no longer on the queue.

On branch main, narayana is used as transaction manager; on branch atomikos, atomikos is used as transaction manager.

Expected behaviour

The message should be present in the JMS queue after the XA rollback

Actual behaviour

The message is no longer in the JMS queue

Reproducer

NOTE: To run the reproducer, docker or a docker-compatible container runtime environment, as well as a working testcontainers setup is required.

checkout https://github.com/turing85/spring-jms-xa:

git clone https://github.com/turing85/spring-jms-xa.git
cd spring-jms-xa

Run the tests:
```
./mvnw clean test
```
Observe that:
- the rollback test with 9 consumers and the default task executor succeeds ✅
- the rollback test with 10 consumers and the default task executor fails ❌
- the rollback test with 9 consumers and the thread pool task executor and 9 threads succeeds ✅
- the rollback test with 10 consumers and the thread pool task executor and 9 threads succeeds ✅
- the rollback test with 9 consumers and the thread pool task executor and 10 threads succeeds ✅
- the rollback test with 10 consumers and the thread pool task executor and 10 threads fails ❌
- all happy tests succeed ✅

The same behaviour can be observed with atomikos. The reproducer provides an atomikos-based implementation on branch atomikos. We can run this reproducer as follows:

after checkout, change to branch atomikos:
```
git switch atomikos
```
Run the tests:
```
./mvnw clean test
```
Observe that the tests fail in the same way as they did in narayana

Environment

spring boot version: 3.2.2 java version: OpenJDK Runtime Environment Temurin-17.0.8+7 (build 17.0.8+7)

jhoeller commented 7 months ago

From a quick look, I suspect that the observed behavior has to do with the thread pool executor being limited to 4 threads. A thread pool with 10 threads should behave simular to the default executor with 10 consumers. I don't see any potential for the thread pool setup derailing transactions, just for a different concurrency level leading to different transaction runtime behavior.

As for the effect itself, is the JTA setup actually complete? Which transaction manager is in use there, is the XA pool interacting with that same transaction manager, and is the message broker properly configured for XA as well? If message consumption and the transaction rollback are not in sync, it looks like the XA transaction arrangement is not actually working in your setup, rather coincidentally rolling back two unrelated transactions at the same time... which potentially deviates at concurrency level 10.

turing85 commented 7 months ago

@jhoeller thank you for your response.

Yes, I can confirm that setting the threads of the thread pool to 10 also derails the rollback test using the thread pool and 10 consumers.

In the reproducer, narayana is used, provided by dev.snowdrop:narayana-spring-boot-starter, see here (github.com).

I am quite sure the setup is complete. Running the tests creates a corresponding directory transaction-logs. If we set a breakpoint e.g. here (github.com), we also see that some files are created in the transaction-log directory.

For the broker: the tests use docker.io/apache/activemq-artemis:2.32.0-alpine. The broker should support transactions in its default configuration.

I am not entirely sure I fully understand what you are saying about the deviation. The same (pooling) connection factory is used, and this connection factory uses a single instance of a TransactionManager, see here (github.com) (the TransactionManager is provided by dev.snowdrop:narayana-spring-boot-starter).

The issue originated in camel-jms and was traced by Zheng Feng to the location I mentioned. Here is the relevant discussion in camel.zulipchat.com and here (github.com) is a reproducer with camel and quarkus (which also uses spring-jms under the hood).

jhoeller commented 7 months ago

I'm not sure why or how it would start deviating from cross-resource transaction behavior at concurrency level 10 either :-) Just observing that it seems to be the effective concurrency level that makes the difference - and with the default executor, there is no limit other than the concurrentConsumers setting itself, whereas with a thread pool there is the pool limit as well.

It can be tricky to test actual XA processing. Some setups can fall back to sequential processing of the resources, just triggering the commit/rollback on every resource but not actually being able to restore consistent state in every case. This might also be caused by a participating XA resource not fully implementing the XA contract. That's why I suspected that some such effect from an incomplete XA setup might possibly only show at a higher concurrency level. Other than that, I have no idea why a concurrency level of 9 would still work but a level of 10 would break. Is there special significance of the level 10 in any involved component?

To be clear, I have no experience with Narayana or ActiveMQ XA setups specifically. This is just educated guessing based on my JTA/XA experience from more than a decade ago.

In any case, given that a thread pool setup with the same effective concurrency level exposes the same behavior, I don't see what the Spring JMS components could do differently here.

turing85 commented 7 months ago

I played around with atomikos as transaction manager. It seems that atomikos "works" (I get a bunch of exceptions, most of them tell me that the XID cannot be used), and the tests are slow. But they are green. This points to narayana being the culprit.

I would like to have a setup with a transaction manager that does not throw any exception.

jhoeller commented 7 months ago

Let's close this from the Spring side then, to be reopened if anything comes up that we could do in spring-jms.

turing85 commented 7 months ago

@jhoeller I was able to reproduce the issue 1:1 with atomikos. I have updated the reproducer. So if atomikos and narayana behave identically, it seems that the issue is on the spring-side. Can we re-open the issue?

martku commented 7 months ago

The same behavior as in the reproducer also in our project. Please re-open.

jhoeller commented 7 months ago

We can reopen but it's unclear what we could change here. Please advise on where you actually suspect the problem on Spring's side since the transaction and resource interaction via the JTA and JMS APIs is really pretty standard. Neither the transaction coordinator nor the message broker nor the XA pool are part of Spring, so I don't see the primary responsibility on our side.

It does not seem to be the use of a SimpleAsyncTaskExecutor since it also happens with an equally sized thread pool, so it seems to be the actual concurrency level going beyond 9. Is that still a correct observation? Either way, it is unclear how that is supposed to influence transaction reliability. What is the significance of concurrency level 10, which component reacts differently here?

This might require some further analysis on the side of Atomikos/Narayana, ruling out that there is some remaining configuration issue in the resource setup. However, if both JTA providers are reacting equally at the exact same concurrency level 10 vs 9, I suspect that the significance of that level comes out of the ActiveMQ setup, so I recommend further specific investigation there.

snicoll commented 7 months ago

@martku are you also using ActiveMQ Artemis? Is it the same version as the one in the sample? If that's the case, could any of you try with a different broker (ActiveMQ classic)?

I can't see a reason why this is failing this way and shortening to Spring JMS because two transaction managers behave this way is a bit of a stretch IMO. Let me know if you have the cycles to give it a try. Thanks!

turing85 commented 7 months ago

@snicoll Can you link me a corresponding docker image, and - if needed - the necessary driver dependency for ActiveMQ classic? Both are easy to swap in the current setup; I sould be able to verify this relatively quickly.

snicoll commented 7 months ago

If you create an app on start.spring.io with ActiveMQ classic and docker compose support, you should get all of that.

martku commented 7 months ago

@martku are you also using ActiveMQ Artemis? Is it the same version as the one in the sample? If that's the case, could any of you try with a different broker (ActiveMQ classic)?

I can't see a reason why this is failing this way and shortening to Spring JMS because two transaction managers behave this way is a bit of a stretch IMO. Let me know if you have the cycles to give it a try. Thanks!

Yes. Thanks @turing85 for trying

turing85 commented 7 months ago

@martku are you also using ActiveMQ Artemis? Is it the same version as the one in the sample? If that's the case, could any of you try with a different broker (ActiveMQ classic)?

I can't see a reason why this is failing this way and shortening to Spring JMS because two transaction managers behave this way is a bit of a stretch IMO. Let me know if you have the cycles to give it a try. Thanks!

For ActiveMQ 5 ("ActiveMQ classic"), the breaking point is different, but the behaviour is the same. Instead of 9/10 connections, the breaking point is 6/7 connections. The correpsonding reproducers can be found at https://github.com/turing85/spring-jms-xa, on branches activemq-classic-narayana and activemq-classic-atomikos.

Test results:

the rollback test with 6 consumers and the default task executor succeeds ✅
the rollback test with 7 consumers and the default task executor fails ❌
the rollback test with 6 consumers and the thread pool task executor and 6 threads succeeds ✅
the rollback test with 7 consumers and the thread pool task executor and 6 threads succeeds ✅
the rollback test with 6 consumers and the thread pool task executor and 7 threads succeeds ✅
the rollback test with 7 consumers and the thread pool task executor and 7 threads fails ❌
all happy tests succeed ✅

I tested the classic variants with both quay.io/artemiscloud/activemq-artemis-broker:artemis.2.32.0 and docker.io/symptoma/activemq:5.18.3 as broker; no difference. The ActiveMQ Artemis variants (on branches main and atomikos) can only be run with the artemis image, obviously.

jhoeller commented 7 months ago

This is obviously a consequence of the actual concurrency level if we see equivalent behavior for similarly configured task executors and for both message broker variants - even if the breaking point is lower with ActiveMQ classic. The effect happening from 6->7 with ActiveMQ classic but from 9->10 with Artemis is interesting. Could be some overall resource limit hit earlier with ActiveMQ classic, or some classic-only ActiveMQ background threads competing in terms of the effective concurrency level.

Any chance to gain some insight into the runtime difference when the 7th/10th thread fails? Could some log from ActiveMQ and Atomikos/Nayarana reveal any clues here? The key question is why the message gets acknowledged despite an XA rollback. From a Spring perspective, we simply call MessageConsumer.receive and eventually UserTransaction.rollback; the underlying interaction between the message broker connection and the transaction coordinator is up to your resource setup.

turing85 commented 7 months ago

I am sure that if one increases the log level (e.g. for the narayana component), one would see the logs. The reproducer is there; feel free to increase the log level of categories as needed.

jhoeller commented 7 months ago

I've updated the title of this issue to reflect our understanding of the matter. However, since the current insight does not suggest that this is caused by Spring-specific behavior but rather entirely by the interaction between external components behind standard JMS/JTA API usage, please be aware that we are outside of our core expertise here in the Spring Framework team.

Based on our current understanding, a plain reproducer without any Spring involved should exhibit the same behavior: e.g. with a ThreadPoolExecutor that spins up 10 threads with JMS MessageConsumer receive calls wrapped inside a JTA UserTransaction. If such a reproducer works fine with an effective concurrency level of 10 but a Spring DefaultMessageListenerContainer does not, that would change the situation in that it suggests some Spring-specific behavior being involved in the effect here.

turing85 commented 7 months ago

This is obviously a consequence of the actual concurrency level if we see equivalent behavior for similarly configured task executors and for both message broker variants - even if the breaking point is lower with ActiveMQ classic. The effect happening from 6->7 with ActiveMQ classic but from 9->10 with Artemis is interesting. Could be some overall resource limit hit earlier with ActiveMQ classic, or some classic-only ActiveMQ background threads competing in terms of the effective concurrency level.

Any chance to gain some insight into the runtime difference when the 7th/10th thread fails? Could some log from ActiveMQ and Atomikos/Nayarana reveal any clues here? The key question is why the message gets acknowledged despite an XA rollback. From a Spring perspective, we simply call MessageConsumer.receive and eventually UserTransaction.rollback; the underlying interaction between the message broker connection and the transaction coordinator is up to your resource setup.

I re-ran the tests with the transaction manager's log level set to TRACE.

For narayana, I set

logging.level.com.arjuna=TRACE
logging.level.dev.snowdrop=TRACE

in application.properties

For atomikos, I set
```
logging.level.com.atomikos=TRACE
```

I ran ./mvnw clean test | tee <branch-name>.log for all four branches. The logs are attached.

main.log atomikos.log activemq-classic-narayana.log activemq-classic-atomikos.log

jhoeller commented 7 months ago

Thanks, those logs are definitely worth digging into a bit deeper. Some initial observations:

The main log indicates strange behavior from the Nayarana JMS pool where it tries to register a JTA synchronization on Session.close and even Connection.close but fails there when the transaction is aborted already (see "ARJUNA016083"). It generally uses JTA synchronizations for resource management according to the log, most of the time successfully. This could interfere with some assumptions in the underlying connection pool and/or the provider's ConnectionFactory, with resources being closed later than expected.

The Atomikos log shows that Artemis propagates some connection failures through a JMS ExceptionListener, which in turn shows up in CachingConnectionFactory log messages where such exceptions trigger a reconnect. That's not too concerning on its own for regular CachingConnectionFactory usage as alternative to a full-fledged connection pool. However, why is CachingConnectionFactory involved here in the first place? This has no place in an XA setup, all that's needed there is a native XA-aware ConnectionFactory. From a quick look, I could not see where CachingConnectionFactory is actually being set up. Note that CachingConnectionFactory uses a shared Connection which might interfere with XA connection management.

Spring's local resource synchronization seems to be involved as well, possibly making things more complicated than necessary. You could switch off transaction synchronization on the JtaTransactionManager bean through setTransactionSynchronization(SYNCHRONIZATION_NEVER) and see whether that makes any difference, relying solely on the XA connection pool for resource management.

Last but not least, you seem to perform some manual stopping of the JmsListenerEndpointRegistry when your message listener throws an exception to trigger a rollback. This is generally not necessary and might interfere with ongoing message processing. Also, you hand off that manual stop call to a separate thread via a TaskExecutor; even for well-intended stop attempts, this should not be necessary either.

turing85 commented 7 months ago

Thanks, those logs are definitely worth digging into a bit deeper. Some initial observations:

The main log indicates strange behavior from the Nayarana JMS pool where it tries to register a JTA synchronization on Session.close and even Connection.close but fails there when the transaction is aborted already (see "ARJUNA016083"). It generally uses JTA synchronizations for resource management according to the log, most of the time successfully. This could interfere with some assumptions in the underlying connection pool and/or the provider's ConnectionFactory, with resources being closed later than expected.

The Atomikos log shows that Artemis propagates some connection failures through a JMS ExceptionListener, which in turn shows up in CachingConnectionFactory log messages where such exceptions trigger a reconnect. That's not too concerning on its own for regular CachingConnectionFactory usage as alternative to a full-fledged connection pool. However, why is CachingConnectionFactory involved here in the first place? This has no place in an XA setup, all that's needed there is a native XA-aware ConnectionFactory. From a quick look, I could not see where CachingConnectionFactory is actually being set up. Note that CachingConnectionFactory uses a shared Connection which might interfere with XA connection management.

Spring's local resource synchronization seems to be involved as well, possibly making things more complicated than necessary. You could switch off transaction synchronization on the JtaTransactionManager bean through setTransactionSynchronization(SYNCHRONIZATION_NEVER) and see whether that makes any difference, relying solely on the XA connection pool for resource management.

Last but not least, you seem to perform some manual stopping of the JmsListenerEndpointRegistry when your message listener throws an exception to trigger a rollback. This is generally not necessary and might interfere with ongoing message processing. Also, you hand off that manual stop call to a separate thread via a TaskExecutor; even for well-intended stop attempts, this should not be necessary either.

Thank you @jhoeller for taking a look. I will try setTransactionSynchronization(SYNCHRONIZATION_NEVER) and report back.

For the stopping: this is mostly done so the application stops consuming messages, and I can assert in the test that the message is still present. If I do not stop the consumer, the application would try reading the message over and over again. I could set up a dead letter queue and check that queue in the tests instead. This, however, would complicate the test setup. Plus, I know how to set up the queue in artemis, but I do not know how to set it up in ActiveMQ 5.

turing85 commented 7 months ago

Here are the runs with setTransactionSynchronization(SYNCHRONIZATION_NEVER):

main.log atomikos.log activemq-classic-narayana.log activemq-classic-atomikos.log

turing85 commented 7 months ago

@snicoll what feedback exactly are you waiting for?

jhoeller commented 7 months ago

Ok, so Spring's local resource synchronization is not causing any harm here: Those logged synchronization exceptions with Narayana are the same, just caused by a different call stack. Thanks for trying that variant. (As a side note: It's nevertheless odd that the Narayana JMS pool fails to close its own resources due to internal use of JTA synchronizations in an impossible state.)

I've just noticed that Artemis actually seem to have a dead letter queue installed (by default?): Right before the test fails, there is the following log entry... 2024-02-08T19:05:45.810+01:00 INFO 160985 --- [ream-1207638882] 🐳 : 2024-02-08 18:05:45,633 WARN [org.apache.activemq.artemis.core.server] AMQ222149: Sending message Reference[1056]:RELIABLE:CoreMessage[messageID=1056, durable=true, userID=ae46e43d-c6ac-11ee-b939-28dfeb2beb98, priority=4, timestamp=Thu Feb 08 18:05:45 UTC 2024, expiration=0, durable=true, address=queue, size=220, properties=TypedProperties[__AMQ_CID=ae46961a-c6ac-11ee-b939-28dfeb2beb98, _AMQ_ROUTING_TYPE=1]]@1066580182 to Dead Letter Address DLQ from queue

Is the assertion in the test actually reliable? If the broker autonomously moves the message to the DLQ once it failed too often, it would be gone from the original queue out of that reason... even if the XA rollback worked fine. Could it be that 10 concurrent workers are reaching the limit for a DLQ move within the time of the test, whereas 9 workers are not quite reaching it yet?

P.S.: Still wondering where the CachingConnectionFactory comes from. It causes a lot of disconnect stack traces in the ActiveMQ classic case, apparently after a hard shutdown of the socket connection to the broker. It might sit around for unrelated reasons, e.g. for JmsTemplate usage, and not be in actual action here... so maybe we can safely ignore it?

turing85 commented 7 months ago

Welp, mystery solved. I've set a breakpoint after the service went down, but before the test asserts that the message is on the queue... Thanks a lot @jhoeller

jhoeller commented 7 months ago

No worries, glad that I could help, and I learned a few things along the way :-)

turing85 commented 7 months ago

Just in case anyone is interested: I documented the solution in the reproducer, and reconfigured the docker image (had to switch from artemis-broker to artemis-broker-kubernetes) so that messages are never sent to the DLQ.

spring-projects / spring-framework