Closed graben closed 6 years ago
Thanks @graben I don't think it's that big issue however.
resources
and recoverables
in 3 TM wrappers, but these are called only during registration of connection manager for given XA resource - but this is called usually by blueprint, scr or when you simply register XADataSource/XAConnectionFactorytransactions
in AbstractTransactionManagerWrapper
but this map is keyed by javax.transaction.Transaction
obtained directly from underlying TM, which should take care of synchronization - each thread gets different transaction anyway.@gnodet what do you think - maybe I missed something?
This is not VIRTUAL or THEORETICAL issues. Actualy, we have real production thread stuck on transactions Map (see original pull request for stack). Reproducable 100%. Transactions map changes without any synchronization. VM with only 2 cores and very commodity load.
@grgrzybek resources and recoverables accessed from many different threads. This leads to strange hardly reproducible bugs. HOWEVER, synchronization for this collections not not have any measurable performance impact because is used only on resource registration/unregistration.
Wouldn't it be better to add explicit synchronization instead of rely on the synchronized map? I found synchronized map less helpfull because it does only address the problem of hashmap corruption but does not solve the problem with data consistency. @vladimirfx If you say it is 100% reproducible can you tell the codepath from the thread-dump that cause the problem? @grgrzybek I think transactions are a critical part of the system and we should address any issue (small or big ;-) )
"schedulerSyncChanges-1" Id=162 in RUNNABLE at java.util.WeakHashMap.transfer(WeakHashMap.java:528) at java.util.WeakHashMap.resize(WeakHashMap.java:493) at java.util.WeakHashMap.put(WeakHashMap.java:466) at java.util.Map.computeIfAbsent(Map.java:958) at org.ops4j.pax.transx.tm.impl.AbstractTransactionManagerWrapper.getTransaction(AbstractTransactionManagerWrapper.java:49) at org.ops4j.pax.transx.connector.impl.GenericConnectionManager.getMci(GenericConnectionManager.java:186) at org.ops4j.pax.transx.connector.impl.GenericConnectionManager.allocateConnection(GenericConnectionManager.java:171) at org.ops4j.pax.transx.connector.impl.GenericConnectionManager.allocateConnection(GenericConnectionManager.java:167) at org.ops4j.pax.transx.jdbc.impl.TransxDataSource.getConnection(TransxDataSource.java:68) at org.ops4j.pax.transx.jdbc.impl.TransxDataSource.getConnection(TransxDataSource.java:59) at sun.reflect.GeneratedMethodAccessor328.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498)
For transactions map syncronization on some particular mutex not needed because transaction is thread bound. There is no real concurrency - just HashMap corruption. But for resource registration - I agrees, synchronization of reg/unreg methods on some mutex is reasonable. I' can update my pull request.
Another question never asked is what are the maps in the wrappers for? It seems they are not used anywhere in the code or do I miss something?
Maybe we should remove them (resource/recover maps) at all and only synchronize getTransaction() method in AbstractTransactionManagerWrapper.
Thanks for feedback. I did some detailed race/lock analysis in different projects (Karaf, Aries) before, but here I didn't yet have the chance. I believe this may be the issue and I can merge this PR without problems.
If you have more issues, I'll be happy to review them - however I'm back from holidays in ~2 weeks. Today is my last day ;)
Well, it would be nice to have feedback from @gnodet about my last statement since I actually see no usage/need for most of the Maps. Let us talk about that in two weeks after your holiday. Maybe we can optimize that before releasing 0.4. :)
May be prepare release 0.4.0 and optimize in 0.5? JB can include this fix and #5 in Karaf 4.2.1.
@gnodet @jbonofre may I ask you to review this as well? Thanks!
@grgrzybek : Plz have a look because this looks like a valid bug if heavy used concurrently.