Build custom buffered input stream that can reuse tmp read buffers across indexing ops

franz1981 commented 1 year ago

Deloying a wildfly instance with many JPA models shows that buffered files reads allocate, by default, a 8K byte[] buffer per each indexed file, causing many out of TLAB allocations.

Ideally we could reuse such buffer across different indexed ops, saving zeroing and TLAB bandwidth that could be used elsewhere.

I'm assuming 2 things:

the same Indexer to be reused
Indexer is a single-threaded class that can just borrow/release the same buffer again and again without being bothered by concurrency issues

@Ladicek these are valid assumptions?

franz1981 commented 1 year ago

I'll add more commits addressing other allocation(s) hot paths, but probably these were happening because of this big byte[] allocations, leaving scarse bandwith for further allocations, ending up allocating in a new TLAB (or out of the TLAB).

I'll provided a JMH bench and/or an end 2 end results if required/possible

franz1981 commented 1 year ago

The relevant stack traces I've found are:

https://github.com/wildfly/wildfly-core/blob/main/server/src/main/java/org/jboss/as/server/deployment/annotation/ResourceRootIndexer.java#L84

that seems to match the assumption of the same Indexer reused over and over, but sadly...

https://github.com/hibernate/hibernate-orm/blob/6cf9d2d4801962b659eb4fb936a58abd461fbfc1/hibernate-core/src/main/java/org/hibernate/boot/archive/scan/spi/ClassFileArchiveEntryHandler.java#L64

nope. I have to think if it worth to change how the Indexer is used on Hibernate (making it to be reused/shared somehow, if possible) and/or change the optimization by providing an indexer configuration with the initial buffer size (or the buffer itself, provided externally). @Sanne any suggestion?

scottmarlow commented 1 year ago

I'd like to build jandex locally with this change to see if I can see improvements on deployment time of https://github.com/scottmarlow/tribe-krd-quarkus/tree/wildfly. I tried building with different JDK versions (jdk 1.6/1.8/11 but still get failure like:

Failed to execute goal net.revelc.code:impsort-maven-plugin:1.6.2:sort (sort-imports) on project jandex: Execution sort-imports of goal net.revelc.code:impsort-maven-plugin:1.6.2:sort failed: A required class was missing while executing net.revelc.code:impsort-maven-plugin:1.6.2:sort: org/codehaus/plexus/util/DirectoryScanner [ERROR] ----------------------------------------------------- [ERROR] realm = plugin>net.revelc.code:impsort-maven-plugin:1.6.2--1736414650 [ERROR] strategy = org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy [ERROR] urls[0] = file:/home/smarlow/.m2/repository/net/revelc/code/impsort-maven-plugin/1.6.2/impsort-maven-plugin-1.6.2.jar [ERROR] urls[1] = file:/home/smarlow/.m2/repository/com/github/javaparser/javaparser-core/3.22.1/javaparser-core-3.22.1.jar [ERROR] urls[2] = file:/home/smarlow/.m2/repository/com/google/guava/guava/30.1.1-jre/guava-30.1.1-jre.jar [ERROR] urls[3] = file:/home/smarlow/.m2/repository/com/google/guava/failureaccess/1.0.1/failureaccess-1.0.1.jar [ERROR] urls[4] = file:/home/smarlow/.m2/repository/com/google/guava/listenablefuture/9999.0-empty-to-avoid-conflict-with-guava/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar [ERROR] urls[5] = file:/home/smarlow/.m2/repository/com/google/code/findbugs/jsr305/3.0.2/jsr305-3.0.2.jar [ERROR] urls[6] = file:/home/smarlow/.m2/repository/org/checkerframework/checker-qual/3.8.0/checker-qual-3.8.0.jar [ERROR] urls[7] = file:/home/smarlow/.m2/repository/com/google/errorprone/error_prone_annotations/2.5.1/error_prone_annotations-2.5.1.jar [ERROR] urls[8] = file:/home/smarlow/.m2/repository/com/google/j2objc/j2objc-annotations/1.3/j2objc-annotations-1.3.jar [ERROR] Number of foreign imports: 1 [ERROR] import: Entry[import from realm ClassRealm[project>io.smallrye:jandex:3.0.6-SNAPSHOT, parent: ClassRealm[maven.api, parent: null]]] [ERROR]

Any suggestions on how to avoid ^ failure? Thanks!

scottmarlow commented 1 year ago

Thanks @jamezp for pointing out that Maven 3.9 has ^ problem (3.8.1 works fine now). Also thanks for suggesting other solution of mvn clean install -Dversion.impsort.plugin=1.8.0

Ladicek commented 1 year ago

Yeah, I need to attend to #298 some day, this is pretty unfortunate :-/

When it comes to Indexer assumptions, those are correct: it is never supposed to be used concurrently, and it is supposed to be reused, though it doesn't have to. It is relatively common for a single Indexer to be used to index just one class. With Jandex 2.x, an Indexer returned a ClassInfo after indexing each class, but Jandex 3.x does some postprocessing during Indexer.complete(), so it doesn't return a ClassInfo for the just-indexed class anymore.

We could have an Indexer take an instance of some class that would manage those buffers I guess?

franz1981 commented 1 year ago

We could have an Indexer take an instance of some class that would manage those buffers I guess?

Today I am on PTO but yes, that would be the perfect deal - and would help the hibernate case as well. I see there are others allocations but if we solve the bigger one, probably it won't be necessary to fix them too

Ladicek commented 1 year ago

Enjoy your PTO! :-) I'm on vacation most of next week (except Mon), but feel free to ping me the week after that.

And thanks for looking into this -- I didn't have much time (and I lack the experience :laughing:), so the only thing I noticed is that a lot of time is spent on java.io.DataInputStream parsing the modified UTF-8 encoding (which likely only exists because the JVM is written in a language with null-terminated strings...). I didn't even look at allocations.

franz1981 commented 1 year ago

@Ladicek Not a big fan of what's I've done here, but let me provide some context

74f33e9bb5dbc35b438b5f9230956506002ad1fa allows users of Indexer to recycle the underlying read stream without being aware of the existing APIs and us to not rewrite a BufferedInputStream impl, although not complex, will still force us to maintain more code (adding more tests etc etc).

With this solution:

if Indexer is reused by user code (the wildfly use case): it will benefit from reusing buffered input streams (and the underlying buffer)
if Indexer is not reused (hibernate): it will be a duty of user code to snapshot it's memento, storing it somewhere, and providing it to any subsequent Indexer instance.

The reason why it will use Object is both to save users to mess up with it and to allow us, in near future (maybe in this PR) to expose a whole set of maps/byte[] in order to be reused again and again (saving more and more memory), wdyt?

Ladicek commented 1 year ago

Okay, this is pretty nice actually!

On the API front, I think we can hide the "memento object" by providing an IndexerFactory for the non-reused use case. It should be pretty easy for Hibernate to reuse the IndexerFactory, and then it doesn't matter that they don't reuse the Indexer.

If you're fine with the improvements now, I can take it from here and massage the API a little bit per above.

franz1981 commented 1 year ago

I would like to run few benchs and see if together with a modified version of hibernate I'm getting the expected speedup. If yes, we can move this forward and addressing later other (minors) byte[] allocation(s) into the same memento (and decodeUTF8, that doesn't appear in the test with wildfly run by @scottmarlow but maybe you got more data points why you saw that, for quarkus, likely).

Ladicek commented 1 year ago

Okay, just let me know when you think this is ready. For my testing, I assembled a somewhat bigger JAR (160K classes, 250 MB) and indexing that shows decodeUtf8Entry as taking quite a lot of time. But I'm hardly an expert, I might have measured it wrong.

franz1981 commented 1 year ago

and indexing that shows decodeUtf8Entry as taking quite a lot of time

I believe the reason for that is due to the many intermediate obj (with their buffers) created to decode in what (very likely) is just a latin String: I think I'll create a fast path for this to cover the most common case, but in a different commit/pr, likely

franz1981 commented 1 year ago

I've further extended the pool concept, but is very tied to the way we use the tmp buffers for indexing, and some are not as safe as I wish i.e. very tied to how processConstantPool fill data while in others I won't reset the tmp buffers content, meaning that a misbehaving class could potentially cause read into something that would have thrown an out of bounds exception with non pooled buffers, but now can read "junk" data coming from a previous recycling.

The easier solution is to cleanup everything when borrowed again, but it will cause some perf hit (to be measured). I'm paused working on other stuff, but feel free to add commits to make this one better from an API perspective, or let me know if there are other parts you want me to improve.

Ladicek commented 1 year ago

Thank you! As mentioned above, I'm on PTO starting tomorrow until end of week, so I'll get to this next week. I'd also like to learn more about performance work, so I'll be happy to help with this (once I'm back).

franz1981 commented 1 year ago

@scottmarlow I should add few tests but this could be used already as it is, improving the wildfly use case at https://github.com/wildfly/wildfly-core/blob/main/server/src/main/java/org/jboss/as/server/deployment/annotation/ResourceRootIndexer.java#L84

while

https://github.com/hibernate/hibernate-orm/blob/6cf9d2d4801962b659eb4fb936a58abd461fbfc1/hibernate-core/src/main/java/org/hibernate/boot/archive/scan/spi/ClassFileArchiveEntryHandler.java#L61-71

should have a new field Indexer.Factory indexerFactory in ClassFileArchiveEntryHandler that should perform

        private ClassDescriptor toClassDescriptor(ArchiveEntry entry) {
        try (InputStream inputStream = entry.getStreamAccess().accessInputStream()) {
                        // DELETED -> Indexer indexer = new Indexer();
                        // REPLACED WITH AN INDEXER REUSING BUFFERS
                        Indexer indexer = indexerFactory.newIndexer();
            ClassSummary classSummary = indexer.indexWithSummary( inputStream );
            Index index = indexer.complete();
            return toClassDescriptor( classSummary, index, entry );
        }
        catch (IOException e) {
            throw new ArchiveException( "Could not build ClassInfo", e );
        }
    }

@Ladicek can confirm that's the right way to do it or not...

@scottmarlow please remember that a IndexerFactory cannot be used in parallel by different threads i.e. is not thread-safe!

franz1981 commented 1 year ago

I've sadly added 2 Arrays::fill while borrowing 2 recycled arrays, but I'm sure this could be avoided by lazy checking if there's no usage of the borrowed array instances, saving precious cpu cycles.

In some subsequent commits I would like to remove all the field loads for the fields (by passing them as method parameters):

    private byte[] constantPool;
    private int[] constantPoolOffsets;
    private byte[] constantPoolAnnoAttrributes;

in many tiny methods, because it:

is super slow while interpreted because of field load
is super slow due to additional bound checks happening very time
it likely nullify the benefit of inlining these small methods using such fields

franz1981 commented 1 year ago

Good news! The allocation profiling data for this PR are very promising: the last addressable thing is

that I was hoping was addressed by reusing the same constantPool buffer again and again, and hoping the heuristic that enlarge it 20 times the required size to be enough were enough, but I was wrong. Maybe worth thinking to a different heuristic here

franz1981 commented 1 year ago

The last versions of the changes (that include a different growing strategy that mimic what array list does) reports, for wildfly startup in a dummy project an allocation reductions of 1.2 GB from 3.9 GB i.e. new allocations are 2.68 GB and almost all the out of TLAB allocations are gone as well.

CPU-wise I think that if @Ladicek got something that can be turned into a micro/smoke perf test for this that heavily make use of decodeUtf8 we will see some gain there as well

franz1981 commented 1 year ago

I've prepared a microbenchmark that load a big jar (~25 MB) and place it into a tmpfs folder to avoid I/O to be a relevant factor here and I see that I'm getting a "small" improvement (mem-wise, huge, in loading speed, not "wow") i.e.

with the changes in the PR:

Benchmark                          (count)                                  (jarFile)  Mode  Cnt      Score   Error  Units
IndexerBenchmark.indexJars               1  /tmp/quarkus-tribe-krd-1.0.0-SNAPSHOT.jar    ss    2  21174.930          ms/op

before the changes in the PR:

Benchmark                          (count)                                  (jarFile)  Mode  Cnt      Score   Error  Units
IndexerBenchmark.indexJars               1  /tmp/quarkus-tribe-krd-1.0.0-SNAPSHOT.jar    ss    2  23488.840          ms/op

The microbenchmark is using SingleShot benchmark run, loading that single jar file, to mimic real world scenario (no hot code compiled at last tiered comp level etc etc) and still, appear disappointing.

What I've discovered so far is that:

beside the small load speed improvement (%10%), the actual cpu usage improvement is significant: we now use 87% of the cpu cycles while loading it faster
the test I've build is using Index of(File... files) that's not likely the best case to test the imrprovements made, given that it perform its own class file unzipping and file walking - meaning that it won't use the optimized/pooled buffered input stream provided on the shared Indexer
org/jboss/jandex/StrongInternPool is performing a chain of instanceof that breaking the profiler I'm using, likely due to https://github.com/async-profiler/async-profiler/issues/673

Below the flamegraph of a benchmark run:

it shows the StrongInternPool usage as a top cpu cycles consumer, and some broken stack frames due to instanceof abuse in the left bottom

The suggestion would be to have separate factory method for intern pools hiding an enum field that can be used as n hint to always apply (per-type) a specific equals/hash method expecting the right type(s) upfront, hence saving type checks to discover it each time.

Ladicek commented 1 year ago

I'm closing this, because I just submitted #303, which uses most of the commits from here, except:

~~the commit "Modify sizeToFit by reusing the same enlarging strategy of ArrayList" is dropped as it doesn't bring measurable improvement, sometimes it even leads to slightly worse performance;~~
the Indexer.Factory is dropped, because it isn't needed; after calling Indexer.complete(), the Indexer instance is fresh as new and can be used to build a new, independent index -- so the thing that should be shared is not a factory, it's the indexer directly.

I also added a few commits of my own, created with a lot of @franz1981's help -- thanks!

Together, I've observed roughly 25% speedup on indexing a 256 MB JAR containing more than 160K classes. Decent improvement for now, and I need to focus on other things as well :-)

smallrye / jandex

Build custom buffered input stream that can reuse tmp read buffers across indexing ops #302