jakartaee / validation

Jakarta Validation
http://beanvalidation.org
Apache License 2.0
127 stars 54 forks source link

Bug#180_Stuck_threads_due_to_WeakHashMap #181

Open lananda opened 2 years ago

lananda commented 2 years ago

I'm not sure how this is supposed to solve the problem exactly.

As far as I can see, the code that accesses this map is already synchronized:

      private synchronized List<ValidationProvider<?>> getCachedValidationProviders(ClassLoader classLoader) {
          SoftReference<List<ValidationProvider<?>>> ref = providersPerClassloader.get( classLoader );
          return ref != null ? ref.get() : null;
      }
      private synchronized void cacheValidationProviders(ClassLoader classLoader, List<ValidationProvider<?>> providers) {
          providersPerClassloader.put( classLoader, new SoftReference<>( providers ) );
      }

So, in theory, adding one more level of synchronization wouldn't hurt (except performance-wise), but I don't see how it could help?

Also... I'd really like more info about #180. It mentions "stuck threads", but if you're talking about a deadlock, there should be multiple threads involved, and I only see one in the issue description... ?

Here (getCachedValidationProviders) the synchronization is on class monitor - this is not going to help this situation, the HashMap itself is not synchronized & that is leading to the issue. this has been reviewed by JDK team too. they suggest the Validation.java WeakHashMap not being synchronized is the root cause.

lananda commented 2 years ago

I have updated the bug with stack of other deadlocked thread

yrodiere commented 2 years ago

Here (getCachedValidationProviders) the synchronization is on class monitor - this is not going to help this situation, the HashMap itself is not synchronized & that is leading to the issue.

Those two methods are not synchronizing on the class, but on the GetValidationProviderListAction instance (they're instance methods). What you synchronize on should not matter, as long as all threads accessing the same resource use the same lock, and that's the case for getCachedValidationProviders and cacheValidationProviders.

That being said, I see there's a static clearCache method which bypasses the instance synchronization and indeed only synchronizes on the class. Is it possible that your problem is with that method?

this has been reviewed by JDK team too. they suggest the Validation.java WeakHashMap not being synchronized is the root cause.

I'm not disputing the fact that WeakHashMap is not thread-safe, which is what your leaks seem to point at. WeakHashMap is not thread-safe, that's a fact.

What I'm arguing is that we already synchronize access to that map, so adding Collections.synchronizedMap should not be necessary.

But as I said above, the clearCache method seems problematic, so if you can confirm that method caused your problem, fixing synchronization in that method should fix your problem as well.

gsmet commented 2 years ago

Apart from @yrodiere 's comment, I would repeat what I initially said: these particular methods should generally be called once in the lifetime of an application. I wonder if you are not creating EntityManagerFactory and ValidatorFactory far more often than you should be. I already said that several times but you're ignoring this point and I think it's important.

Not saying that we shouldn't fix a synchronization issue if we end up finding one - for now, it's not clear if there is one. But my point is that even if we had one, you shouldn't experience it, except if you're doing very weird things that I wouldn't recommend if you want your application to perform well.

lananda commented 2 years ago

Apart from @yrodiere 's comment, I would repeat what I initially said: these particular methods should generally be called once in the lifetime of an application. I wonder if you are not creating EntityManagerFactory and ValidatorFactory far more often than you should be. I already said that several times but you're ignoring this point and I think it's important.

Not saying that we shouldn't fix a synchronization issue if we end up finding one - for now, it's not clear if there is one. But my point is that even if we had one, you shouldn't experience it, except if you're doing very weird things that I wouldn't recommend if you want your application to perform well.

I have requested client to confirm on this API usage. I will check that.

yrodiere commented 2 years ago

That being said, I see there's a static clearCache method which bypasses the instance synchronization and indeed only synchronizes on the class.

I opened #182 to fix that. But again, I'm not sure that's your problem, since your thread dump in #180 was incomplete :)

lananda commented 2 years ago

Here (getCachedValidationProviders) the synchronization is on class monitor - this is not going to help this situation, the HashMap itself is not synchronized & that is leading to the issue.

Those two methods are not synchronizing on the class, but on the GetValidationProviderListAction instance (they're instance methods). What you synchronize on should not matter, as long as all threads accessing the same resource use the same lock, and that's the case for getCachedValidationProviders and cacheValidationProviders.

That being said, I see there's a static clearCache method which bypasses the instance synchronization and indeed only synchronizes on the class. Is it possible that your problem is with that method?

this has been reviewed by JDK team too. they suggest the Validation.java WeakHashMap not being synchronized is the root cause.

I'm not disputing the fact that WeakHashMap is not thread-safe, which is what your leaks seem to point at. WeakHashMap is not thread-safe, that's a fact.

What I'm arguing is that we already synchronize access to that map, so adding Collections.synchronizedMap should not be necessary.

But as I said above, the clearCache method seems problematic, so if you can confirm that method caused your problem, fixing synchronization in that method should fix your problem as well.

Thanks, I have updated the stack, the issue is not happening during clearCache.

lananda commented 2 years ago

FullThreadDump_Validtion_Hashmap.txt

gsmet yrodiere I have now attached the full thread dump for this issue. could you please review once. I think the fix in clearCache is unrelated to the current issue, as this clearcache invocation is nowhere in the call stack & this issue is happening during initialization. i have checked the code EntityManagerFactory and ValidatorFactory are called only once in application lifecycle

yrodiere commented 2 years ago

Thanks.

To clarify: the only reason I'm reluctant to merge this patch is that it doesn't seem to fix anything. I'd like to understand how the patch can fix anything, and ideally also the problem itself.

As you can clearly see from your own thread dump, access to the hash map is already protected by a lock:

    at java.util.WeakHashMap.get(WeakHashMap.java:403)
    at javax.validation.Validation$GetValidationProviderListAction.getCachedValidationProviders(Validation.java:368)
    - locked <0x00000005f44e3e10> (a javax.validation.Validation$GetValidationProviderListAction)

And after #182, that lock is being used by every method that could possibly access that map. So wrapping the map with Collections.synchronizedMap is unlikely to change anything to the situation; you'll just have two locks instead of one.

I think the fix in clearCache is unrelated to the current issue, as this clearcache invocation is nowhere in the call stack

As far as I understand, clearCache doesn't need to be visible in your thread dump; it's enough that someone called clearCache concurrently to your stuck thread at some point. I.e. the thread that called clearCache might have moved on, it's not the one being stuck.

i have checked the code EntityManagerFactory and ValidatorFactory are called only once in application lifecycle

That's good.

Did you also check that clearCache is never called until shutdown?

Be aware that this code will be invoked reflectively, see https://github.com/jakartaee/validation/blob/1aac3390f131665dd46fadc4a0b13849d0fa0afe/src/main/java/jakarta/validation/Validation.java#L157-L167

Also, considering the history behind clearCache, it's entirely possible that WebLogic clears the cache by directly retrieving the map through reflection, thereby bypassing the locks. If that's the case, I would suggest reporting a bug to WebLogic. If they go that far, they might as well call clearDefaultValidationProviderResolverCache reflectively, like other application servers such as WildFly do.

Finally... did you try to run your application with an updated bean-validation JAR that includes my patch (#182), to see if the problem still occurs?

lananda commented 2 years ago

this is a known issue about WeakHashMap, Here are few issues where WeakHashMap is not synchronized is reported https://github.com/eclipse-ee4j/metro-jax-ws/issues/61 https://github.com/spring-projects/spring-loaded/issues/194 https://www.adam-bien.com/roller/abien/entry/endless_loops_in_unsychronized_weakhashmap

Here is the problem as it showed up in Nashorn. https://bugs.openjdk.java.net/browse/JDK-8075006 fixed Nashorn. http://hg.openjdk.java.net/jdk9/jdk9/nashorn/rev/78f82d897305

The Java SE API publicizes that WeakHashMap is not synchronized. In this case, it is being called by javax.validation.Validation$GetValidationProviderListAction.

Hence the fix is suggested at Validation. This been suggested after review from JDK developers.

Weblogic is not calling clearCache using reflection, We are unable to reproduce the issue at will, hence unable confirm the fix of #182

yrodiere commented 2 years ago

this is a known issue about WeakHashMap

Yes, I've already followed your links and I understand the problem. The question is whether your patch would solve it, considering we already have an equivalent solution (synchronized keywords on methods accessing the map) in place.

fixed Nashorn. http://hg.openjdk.java.net/jdk9/jdk9/nashorn/rev/78f82d897305

Great minds think alike, then, because they just synchronized all methods accessing the WeakHashMap. Which is what Bean Validation does, too, in javax.validation.Validation$GetValidationProviderListAction.

The Java SE API publicizes that WeakHashMap is not synchronized.

Indeed. We do synchronize our own calls to WeakHashMap, though.

In this case, it is being called by javax.validation.Validation$GetValidationProviderListAction.

... whose methods accessing the WeakHashMap are all synchronized.

Weblogic is not calling clearCache using reflection

Calling clearCache would be fine, at least after my patch. I was saying that if WebLogic clears the cache by directly retrieving the map through reflection, then that is a problem, and WebLogic shouldn't do that. I can't check the source code of WebLogic, though, so I don't know what's going on.

We are unable to reproduce the issue at will, hence unable confirm the fix of https://github.com/jakartaee/validation/pull/182

Ok, so as I understand it, you can't confirm my patch solves your problem. I infer you cannot confirm that your patch solves the problem either.

That would be fine if I could understand how on earth your patch could solve the problem, but I don't. Let's hope someone else does and will be able to merge your patch.