Closed schlosna closed 3 months ago
Type
Description
Released 0.1071.0
FWIW I missed that https://bugs.openjdk.org/browse/JDK-8266310 includes fix https://github.com/openjdk/jdk/commit/e47803a84feb6d831c6c6158708d29b4fffc99c9 in JDK 18+ which was not linked from https://bugs.openjdk.org/browse/JDK-8266350 but JDK-8266350 is mentioned in https://github.com/openjdk/jdk/pull/3976#issuecomment-840485444
General
Before this PR:
A service recently encountered a class loader deadlock due to use of Eclipse collections' in combination with a JNI native library. This is likely related to open JDK bug JDK-8266350 Deadlock with signed jar and loadLibrary and is similar to https://github.com/netty/netty/issues/11209 specifically https://github.com/netty/netty/issues/11209#issuecomment-829468638 as the eclipse-collections JAR is signed and the eclipse-collections API factory
ServiceLoader
traversal increases the probability of deadlocking with JNI class loader, especially when multiple threads are spinning up loading classes at service startup.While the service that originally encountered this issue was not using AtlasDB directly in process, we can reduce the probability of encountering this issue in AtlasDB clients by avoiding the eclipse-collections API factory and directly use the impl factory that does not traverse class loaders.
org.eclipse.collections.api.factory.primitive.LongLists
:vs.
org.eclipse.collections.impl.factory.primitive.LongLists
:org.eclipse.collections.impl.list.immutable.primitive.ImmutableLongListFactoryImpl
:org.eclipse.collections.impl.list.mutable.primitive.MutableLongListFactoryImpl
:After this PR:
==COMMIT_MSG== Avoid expensive ServiceLoader class loader traversal lookup and attempt to work around possible JDK-8266350 Deadlock with signed jar and loadLibrary.
See:
==COMMIT_MSG==
Priority:
Concerns / possible downsides (what feedback would you like?): slightly more coupled to eclipse-collections implementation
Is documentation needed?:
Compatibility
Does this PR create any API breaks (e.g. at the Java or HTTP layers) - if so, do we have compatibility?:
Does this PR change the persisted format of any data - if so, do we have forward and backward compatibility?:
The code in this PR may be part of a blue-green deploy. Can upgrades from previous versions safely coexist? (Consider restarts of blue or green nodes.):
Does this PR rely on statements being true about other products at a deployment - if so, do we have correct product dependencies on these products (or other ways of verifying that these statements are true)?:
Does this PR need a schema migration?
Testing and Correctness
What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?:
What was existing testing like? What have you done to improve it?:
If this PR contains complex concurrent or asynchronous code, is it correct? The onus is on the PR writer to demonstrate this.:
If this PR involves acquiring locks or other shared resources, how do we ensure that these are always released?:
Execution
How would I tell this PR works in production? (Metrics, logs, etc.):
Has the safety of all log arguments been decided correctly?:
Will this change significantly affect our spending on metrics or logs?:
How would I tell that this PR does not work in production? (monitors, etc.):
If this PR does not work as expected, how do I fix that state? Would rollback be straightforward?:
If the above plan is more complex than “recall and rollback”, please tag the support PoC here (if it is the end of the week, tag both the current and next PoC):
Scale
Would this PR be expected to pose a risk at scale? Think of the shopping product at our largest stack.:
Would this PR be expected to perform a large number of database calls, and/or expensive database calls (e.g., row range scans, concurrent CAS)?:
Would this PR ever, with time and scale, become the wrong thing to do - and if so, how would we know that we need to do something differently?:
Development Process
Where should we start reviewing?:
If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?:
Please tag any other people who should be aware of this PR: @jeremyk-91 @sverma30 @raiju