jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.87k stars 2.17k forks source link

ConstrainableInputStream pins virtual threads (Java-21) #2054

Closed malkusch closed 9 months ago

malkusch commented 9 months ago

Sample to reproduce:

InputStream in = load();
Jsoup.parse(in, "UTF-8", "http://example.org/")

Running that with -Djdk.tracePinnedThreads=full in Java-21(on a virtual thread) will give the following print:

Thread[#35,ForkJoinPool-1-worker-4,5,CarrierThreads]
  …
  java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:405) <== monitors:1
  org.jsoup.internal.ConstrainableInputStream.read(ConstrainableInputStream.java:64)
  …

BufferedInputStream was refactored to be virtual thread friendly, however with a caveat: It becomes unfriendly when inherited (which ConstrainableInputStream does):

    public BufferedInputStream(InputStream in, int size) {
        …
        if (getClass() == BufferedInputStream.class) { // false in case of ConstrainableInputStream
            lock = InternalLock.newLockOrNull();
            …
        } else {
            lock = null;
            …
        }
    }
…
    public int read(byte[] b, int off, int len) throws IOException {
        if (lock != null) {
            … // Virtual Thread friendly code
        } else {
            synchronized (this) {
                return implRead(b, off, len); // Line 405
            }
        }
    }

With Loom gaining popularity with the latest LTS release, do you want to reconsider ConstrainableInputStream inheriting from BufferedInputStream? A possible mitigation might be to use composition instead, i.e. inherit from FilterInputStream and use a new instance of BufferedInputStream as source:

class ConstrainableInputStream extends FilterInputStream {
…
  static ConstrainableInputStream wrap(InputStream in,…) {
    var buffered = new BufferedInputStream(in);
    return new ConstrainableInputStream(buffered);
  }
}
jhy commented 9 months ago

Thanks for raising this.

After first review, a few notes:

It's kind of annoying that we can't just supply a lock object in the Buffered constructor, and keep the rest as-is. We could use the multi-jar support to do that if it were a new constructor option.

jhy commented 9 months ago

OK, fixed!

It would be great if you could validate this by running a test on 1.17.1-SNAPSHOT. (git pull; mvn install).

I verified that the pin warnings from Jsoup.connect.get() are gone when hitting a remote URL.

It would be good to identify a way to have a integration test for this.

malkusch commented 9 months ago

Thanks a ton for addressing this so timely. I will validate the patch this weekend and come back here.

A local file, or a local URL, don't hit it,

I think that might be related to another limitation in Loom. Some FS operations might not unload a blocking call. I don't remember where I read that, but a keyword was the missing io_uring support in the JDK. Here's an article hinting that (under "Blocking and unmounting").

It would be good to identify a way to have a integration test for this.

I had also that thought. There are JFR events to help that, but it would be nice (if possible at all) if a testing framework would provide such assertions. I will give it a thought and maybe raise an issue at another place, so that everybody can benefit from that.

Thanks again for the fix.

malkusch commented 9 months ago

Here's my promised updated: I can confirm that 1.17.1-SNAPSHOT fixed the issue. No more thread pinning events. Thanks a ton.

jhy commented 9 months ago

That's great, thanks for confirming!

jhy commented 1 month ago

(Ah the way that BufferedInputStream is effectively locked for extension past Java 21 is bugging me. As part of #2186 I want to be able to replace the byte[] buffer with a reused buffer, vs creating a new one for each new InputStream read. Which would be easy enough as it's a protected field, designed to be accessible. But because of the instance check in the constructor, it's no longer viable. It would be great if there was a default non-synchronized BufferedInputStream we could re-use, which didn't include such extension restrictions. As it stands I guess we need to implement one directly which unfortunately adds a bit more to the jar size. Other ideas are of course welcomed!)