Open swirlds-automation opened 1 year ago
There are two issues with FileChannel
read performance:
mount -o remount,noatime
to avoid)FileChannelImpl.readInternal
creates contention in NativeThreadSet.add/remove
which use synchronized
author:OlegMazurov, createdAt:2021-11-10T16:50:35Z, updatedAt=2021-11-10T16:50:35ZLow hanging fruit is to remove the call to open the file channel in the constructor. Quick debugging and stats collection shows that a non-trivial number of DataFileReader is constructed and never written to. But the size call is used. Shortly after, the instance is closed.
Thus, an underlying code path like the following is seen (note: these calls are distributed in the constructor and close call).
FileChannel channel = FileChannel.open(path);
int size = channel.size();
channel.close();
One slight optimization would be to avoid opening/closing the file channel and just directly use:
Files.size(path)
to fetch the file size.
Profiling shows the following performance characteristics for the various code:
Files.size(path)
// Profiled: 644336.744 ± 4939.246 ops/s
FileChannel channel = FileChannel.open(path);
int size = channel.size();
channel.close();
// Profiled: 28231.221 ± 557.278 ops/s
And just simply opening/closing the file:
FileChannel channel = FileChannel.open(path);
channel.close();
// Profiled: 28575.806 ± 659.412 ops/s
author:rjuang, createdAt:2022-06-24T18:59:33Z, updatedAt=2022-06-24T18:59:33Z
Ray(TODO):
I created a benchmark to check how DataFileReader behaves on concurrent load (multiple files, multiple threads) using different approaches: FileChannel (current implementation), MappedByteBuffer, and MemorySegment (incubator level feature in Java 17).
Here is what benchmark does:
Results are:
File Size FS | Files F | Threads T | Block Size K | FileChannel s/op | MappedByteBuffer s/op | MemorySegment s/op |
---|---|---|---|---|---|---|
1Gb | 1 | 10 | 4K | 3.1 | 0.9 | 1 |
16K | 1.3 | 0.8 | 0.8 | |||
256K | 0.9 | 0.8 | 0.8 | |||
10 | 10 | 4K | 1.9 | 1.3 | 1.4 | |
16K | 1.2 | 1.2 | 1.1 | |||
256K | 1 | 1 | 1 | |||
100 | 10 | 4K | 21 | 26 | 25 | |
16K | 7.1 | 11 | 11 | |||
256K | 1.7 | 5 | 5 | |||
200 | 10 | 256K | 2 | 6.5 | 6.8 |
FileChannel is slow when a number of files is low. It may be because of benchmark logic: it reads FS bytes total, so if they are all in a single file, it may be slower than if load is spread across multiple files. With high number of files, FileChannel performs the best. MappedByteBuffer and MemorySegment performances are roughly equal, they work very well with a single file, but not so great with many files.
Without chunkify, I assume our typical read pattern is closer to 4K bytes rather than to 256K. With block sizes like this and medium to high number of files, performance of all 3 methods is roughly the same.
@OlegMazurov @jasperpotts @rjuang @deepak-swirlds FYI. author:artemananiev, createdAt:2022-09-26T15:02:05Z, updatedAt=2022-09-26T15:08:58Z
One more thing to consider. MappedByteBuffer and MemorySegment have fundamental API restriction that all indices are integers, so they can't be used for 2Gb+ files. A workaround is to split a large file into 2Gb chunks and use them as if they were separate files. This is why I tested 100 and even 200 files with the benchmarks. FileChannel is different, it can be used with huge files. So I tested one more mode: instead of 100 x 1Gb files I checked how good or bad is FileChannel with 1 x 100Gb file. Surprisingly, there is no difference (at least, on my Mac). author:artemananiev, createdAt:2022-09-26T15:35:53Z, updatedAt=2022-09-26T15:35:53Z
author:artemananiev, createdAt:2022-10-17T16:54:53Z, updatedAt=2022-10-17T16:54:53Z
There appears to be some confusion here as to what needs to be done, and where the source of slowness is. I suspect there are some improvements to how the Channel is being used that can yield benefits, however. Memory Mapped files carry quite a bit of overhead in Java, so that may not be the best option, particularly if we want to be deployable on any O/S other than Linux (memory mapped files have wildly varying performance characteristics across other O/S's, in my experience).
I would like to do some research into FileChannel and see if there are any options to use Channels for access (so we don't have the risks of external memory, or the complexity of segmenting files), but still avoid synchronized blocks and thread contention. We should be able to use a single read-only channel from a couple hundred virtual threads (or equivalent) without blocking any of them. Specific items to look into (mostly summarized from above):
Doing this properly will likely take quite some time, however, so we'll need to discuss how much time is available, and whether we make some small improvements immediately, and work the larger improvements later, or try to achieve consistent high performance up front. author:jsync-swirlds, createdAt:2023-01-19T18:45:35Z, updatedAt=2023-01-20T15:11:17Z
Additional items to investigate:
Previous Thoughts:
deserialize
: this parameter is questionable and probably can be removed. readDataItem() has two primary use cases. First, it reads real data, e.g. when data source needs a leaf or internal node from disk. Second, it's used to pre-load data (into OS file cache) in a moment before it will be read later. Please, check VirtualRootNode.warm()
for details.
author:artemananiev, createdAt:2023-01-20T18:57:23Z, updatedAt=2023-01-20T18:57:23Z
migrated from:
url=https://github.com/swirlds/swirlds-platform/issues/4291
author:jasperpotts, #:4291, createdAt:2021-11-09T01:51:40Z, updatedAt=2023-02-22T02:49:40Z
labels=P2,Performance,Improvement,Platform Data Structures,Migration:Data
The linux kernel 5.1 and newer includes io_uring API which is a much faster way of doing async network and disk IO.
Video from creator explaining in great detail https://youtu.be/-5T4Cjw46ys - direct to point of performance results https://youtu.be/-5T4Cjw46ys?t=1496
Slide deck on why URING https://kernel-recipes.org/en/2019/talks/faster-io-through-io_uring/
Article on why URING is so important https://dzone.com/articles/the-backend-revolution-or-why-io-uring-is-so-impor
White Paper https://kernel.dk/io_uring.pdf
Documentation https://unixism.net/loti/ref-liburing/submission.html#c.io_uring_prep_readv
Missing Manuals - io_uring worker pool https://blog.cloudflare.com/missing-manuals-io_uring-worker-pool/
This is a C convenience API over io_uring to make it easier to use, created by IO_Uring author and recommended way to use IO_Uring API. https://github.com/axboe/liburing
So far I have found one commercial use of Uring from Java, in QuestDB. It is also OpenSource and Apache licensed so a good place to learn from. https://questdb.io/blog/2022/09/12/importing-300k-rows-with-io-uring/ https://github.com/questdb/questdb/blob/fa23bf9503e97090f4f6cb5122029f4e75ea4c1f/core/src/main/java/io/questdb/std/IOURingImpl.java#L32
This looks like a complete implementation of Uring using Project Panama in Java, biggest issue is code and docs are all in Chinese. https://github-com.translate.goog/dreamlike-ocean/PanamaUring?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp
Looks very interesting, maybe has some tricks to avoid JNI https://github.com/armanbilge/fs2-io_uring
A Java lib around liburing. It is a one guy code base but seems clean and a good starting point. https://github.com/bbeaupain/nio_uring
Netty has a incubator implementation that is quite advanced but does not have support for files. We could propose extending, maybe? https://netty.io/wiki/google-summer-of-code-ideas-2020.html
LibUV is a generic async IO API that us what NodeJS uses. It has a io_uring backend in progress since 2018 but the patch has not been approved. If it was improved then this would be a nice way to have fast maintained cross platform async IO support.
https://github.com/libuv/libuv
There is a Java wrapper for libuv
https://github.com/webfolderio/libuv-java
They are working on support in LibUV for io-uring but struggling with API style compatibility. Is a function blocking, polling, has timeout etc.
https://github.com/libuv/libuv/pull/2322
This is the async kernel API for IO. It has shown in benchmarks to be much faster in many cases. Especially for cloud servers. It allows a bunch of IO requests to be queued up and answered in any order with async responses. This allows the OS, disk driver, firmware stack to optimize those requests and handle them in parallel. It showed in benchmarks maybe 10x gains for Google GCP instances with local RAID SSDs.
There are a couple libaio libraries we could investigate:
https://activemq.apache.org/components/artemis/documentation/1.0.0/libaio.html
Java AIO lib by IBM, outdated but might be handy as reference or starting point. https://github.com/zrlio/jaio
All the Java APIs for disk access use the older synchronous APIs for disk IO. We know from testing there is a 2-10x gain in IOPS if we move to a async API like AIO or IO_URING. Nether of these are available in Java out of the box so we will need to develop a custom native solution. Maybe an alternative implementation of AsynchronousFileChannel might work as a API (Via a new FileSystem Implementation though java.nio.file.spi might work).
We don't know when we are going to need that performance boost but this is a good candidate to get a large boost.
We have tested all the available File APIs in the JDK and not found anything faster than what we are doing with FileChannels. The next step will be to look at using a better low level Async OS API. Such as IO_Uring or AIO, connected to Java with JNI or Project Panama.
Oleg had a good suggestion for a phased plan:
Phase 1 - Add Async Base API
Design and create a new async API that can be used by lowest level of database. It can be implemented either using java
AsynchronousFileChannel
or just a thread pool over file channels. But should be designed so that it can be implemented with LibAIO or IO_Uring native API. Update base part of the DB/VM like compactions and hashing to use new APIs. It might make sense for this API to be in PBJ IO so that it can be fully integrated and work natively in PBJ types likeBytes
.Phase 2 - Prototype implementation using IO_Uring and/or LibAIO
Pick one or more native AIO library and implement prototype implementations, maybe try multiple and benchmark on a linux server with main net spec.
Phase 3 - Extend Async API usage up to App Layer
Expose Async API up to VirtualMap layer so it can be used by the App if we think there is performance benefit.
Phase 4 - Final implementation of native code for IO_Uring etc.
Make the chosen native implementation production ready.
Formally bug was: