swirlds-automation commented 1 year ago

All the Java APIs for disk access use the older synchronous APIs for disk IO. We know from testing there is a 2-10x gain in IOPS if we move to a async API like AIO or IO_URING. Nether of these are available in Java out of the box so we will need to develop a custom native solution. Maybe an alternative implementation of AsynchronousFileChannel might work as a API (Via a new FileSystem Implementation though java.nio.file.spi might work).

We don't know when we are going to need that performance boost but this is a good candidate to get a large boost.

We have tested all the available File APIs in the JDK and not found anything faster than what we are doing with FileChannels. The next step will be to look at using a better low level Async OS API. Such as IO_Uring or AIO, connected to Java with JNI or Project Panama.

Oleg had a good suggestion for a phased plan:

Phase 1 - Add Async Base API

Design and create a new async API that can be used by lowest level of database. It can be implemented either using java AsynchronousFileChannel or just a thread pool over file channels. But should be designed so that it can be implemented with LibAIO or IO_Uring native API. Update base part of the DB/VM like compactions and hashing to use new APIs. It might make sense for this API to be in PBJ IO so that it can be fully integrated and work natively in PBJ types like Bytes.

Phase 2 - Prototype implementation using IO_Uring and/or LibAIO

Pick one or more native AIO library and implement prototype implementations, maybe try multiple and benchmark on a linux server with main net spec.

Phase 3 - Extend Async API usage up to App Layer

Expose Async API up to VirtualMap layer so it can be used by the App if we think there is performance benefit.

Phase 4 - Final implementation of native code for IO_Uring etc.

Make the chosen native implementation production ready.

Formally bug was:

DataFileReader is used from a large number of threads concurrently:

20+ Hashing Threads

Pre-handle get for modify threads

Merging Threads

Reconnect Threads

Pipeline thread

etc. It used to handle this well before we fixed an issue found by Sonar with ThreadLocal handling. We used to create a new FileChannel for each thread and file pair. This was great for performance but would leek file handles over time as files came and went. With thread-local file channels there is no nice way to find every thread that has ever read from a given file when they file is deleted so that we can clean up and close those file channels.

So the way we do it today is to share one FileChannel per file between all threads that read from that file. On the surface that looked like a great solution as FileChannel and the underlying pread() unix command it uses support multiple concurrent reads. The problem is the FileChannel uses a synchronized method, this means all the reading threads fight over that lock, causing huge contention and slowing things down by at least half.

So the challenge is how to support multiple unknown threads to read from a file concurrently. The nice solution seems to be to memory-map the files which will allow multiple readers and offer other performance benefits. The problem is the Java memory-map API has limitations, the biggest of which is there is no API for unmapping a file. There is a workaround using sun.misc.Unsafe, which might work and is often used it just needs VERY careful handling or it can cause seg-faults. Even if we work around that problem then we hit the problem that ByteBuffer can only address 2GB of a file at a time because it's indexes are integers. This can be worked around by limiting the max file size to 2GB or by using multiple ByteBuffer mapping to map 2Gb regions of the file.

The nicer solution seems to be to use the new Java 17 - JEP 393: Foreign-Memory Access API (Third Incubator). It supports mapping files larger than 2Gb, controlled unmapping and safe multithreaded access. This will need to wait till we are building Swirlds with Java 17.

swirlds-automation commented 1 year ago

There are two issues with FileChannel read performance:

if the filesystem records access time, that creates contention in kernel (do mount -o remount,noatime to avoid)
otherwise, FileChannelImpl.readInternal creates contention in NativeThreadSet.add/remove which use synchronized _{author:OlegMazurov, createdAt:2021-11-10T16:50:35Z, updatedAt=2021-11-10T16:50:35Z}

swirlds-automation commented 1 year ago

Low hanging fruit is to remove the call to open the file channel in the constructor. Quick debugging and stats collection shows that a non-trivial number of DataFileReader is constructed and never written to. But the size call is used. Shortly after, the instance is closed.

Thus, an underlying code path like the following is seen (note: these calls are distributed in the constructor and close call).

  FileChannel channel = FileChannel.open(path);
  int size = channel.size();
  channel.close();

One slight optimization would be to avoid opening/closing the file channel and just directly use:

Files.size(path)

to fetch the file size.

Profiling shows the following performance characteristics for the various code:

Files.size(path)  

// Profiled:  644336.744 ± 4939.246  ops/s

  FileChannel channel = FileChannel.open(path);
  int size = channel.size();
  channel.close();

// Profiled: 28231.221 ±  557.278  ops/s

And just simply opening/closing the file:

  FileChannel channel = FileChannel.open(path);
  channel.close();

// Profiled: 28575.806 ±  659.412  ops/s

_{author:rjuang, createdAt:2022-06-24T18:59:33Z, updatedAt=2022-06-24T18:59:33Z}

swirlds-automation commented 1 year ago

Ray(TODO):

[ ] push PRs for benchmark code
[ ] link notion page here. _{author:rjuang, createdAt:2022-09-15T16:28:56Z, updatedAt=2022-09-15T16:29:13Z}

swirlds-automation commented 1 year ago

I created a benchmark to check how DataFileReader behaves on concurrent load (multiple files, multiple threads) using different approaches: FileChannel (current implementation), MappedByteBuffer, and MemorySegment (incubator level feature in Java 17).

Here is what benchmark does:

creates F files, each FS long, on disk
creates T threads
creates F FileChannels/MappedByteBuffer/MemorySegment, one for every file
every thread reads in blocks of size K from random files in the list in a loop, until total read bytes exceeds FS (so from every file about 1 / F is read)
waits for all threads to complete

Results are:

File Size FS	Files F	Threads T	Block Size K	FileChannel s/op	MappedByteBuffer s/op	MemorySegment s/op
1Gb	1	10	4K	3.1	0.9	1
			16K	1.3	0.8	0.8
			256K	0.9	0.8	0.8
	10	10	4K	1.9	1.3	1.4
			16K	1.2	1.2	1.1
			256K	1	1	1
	100	10	4K	21	26	25
			16K	7.1	11	11
			256K	1.7	5	5
	200	10	256K	2	6.5	6.8

FileChannel is slow when a number of files is low. It may be because of benchmark logic: it reads FS bytes total, so if they are all in a single file, it may be slower than if load is spread across multiple files. With high number of files, FileChannel performs the best. MappedByteBuffer and MemorySegment performances are roughly equal, they work very well with a single file, but not so great with many files.

Without chunkify, I assume our typical read pattern is closer to 4K bytes rather than to 256K. With block sizes like this and medium to high number of files, performance of all 3 methods is roughly the same.

@OlegMazurov @jasperpotts @rjuang @deepak-swirlds FYI. _{author:artemananiev, createdAt:2022-09-26T15:02:05Z, updatedAt=2022-09-26T15:08:58Z}

swirlds-automation commented 1 year ago

One more thing to consider. MappedByteBuffer and MemorySegment have fundamental API restriction that all indices are integers, so they can't be used for 2Gb+ files. A workaround is to split a large file into 2Gb chunks and use them as if they were separate files. This is why I tested 100 and even 200 files with the benchmarks. FileChannel is different, it can be used with huge files. So I tested one more mode: instead of 100 x 1Gb files I checked how good or bad is FileChannel with 1 x 100Gb file. Surprisingly, there is no difference (at least, on my Mac). _{author:artemananiev, createdAt:2022-09-26T15:35:53Z, updatedAt=2022-09-26T15:35:53Z}

swirlds-automation commented 1 year ago

file-reads-bench.zip

_{author:artemananiev, createdAt:2022-10-17T16:54:53Z, updatedAt=2022-10-17T16:54:53Z}

swirlds-automation commented 1 year ago

There appears to be some confusion here as to what needs to be done, and where the source of slowness is. I suspect there are some improvements to how the Channel is being used that can yield benefits, however. Memory Mapped files carry quite a bit of overhead in Java, so that may not be the best option, particularly if we want to be deployable on any O/S other than Linux (memory mapped files have wildly varying performance characteristics across other O/S's, in my experience).

I would like to do some research into FileChannel and see if there are any options to use Channels for access (so we don't have the risks of external memory, or the complexity of segmenting files), but still avoid synchronized blocks and thread contention. We should be able to use a single read-only channel from a couple hundred virtual threads (or equivalent) without blocking any of them. Specific items to look into (mostly summarized from above):

FileChannel has the map method to create the MappedByteBuffer, which is the file memory mapped, and will be garbage collected. This is more expensive for the kernel than reading the contents if the size is under about 100K (varies by CPU), so it needs some care. It allows long indexes on create, but the ByteBuffer API is int for access, so we'd still need to manage chunks for files over 2GB in some ways (e.g. position is int, so if we need to reposition, we may have difficulty, sequential reads should work without repositioning, however, much as FileChannel does now)
AsynchronousFileChannel has fewer blocking operations, particularly because it doesn't globally maintain the position of the channel.
Some hybrid combination of those two may be appropriate, using map to map the file into memory for large files, and using the asynchronous version for smaller files.
Improvements in Java 17 and later may remove some of these limitations, but the relevant APIs are still just preview implementations in 17. We build for 17 now, but there is question when we'll move forward from there (as of the time of this writing, the latest version is 19, and 20 is releasing in 2 months).
The current MerkleDB and JasperDB implementations already segment the data into files no larger than 1TB. It might be worth exploring the performance impact of reducing that to ~2GB (31 bit mask) and using more files so that we work around the limitations of ByteBuffer. This works well with modern storage (which is more parallel than spinning disks), but can we count on that?

Doing this properly will likely take quite some time, however, so we'll need to discuss how much time is available, and whether we make some small improvements immediately, and work the larger improvements later, or try to achieve consistent high performance up front. _{author:jsync-swirlds, createdAt:2023-01-19T18:45:35Z, updatedAt=2023-01-20T15:11:17Z}

swirlds-automation commented 1 year ago

Additional items to investigate:

IO_URING (which replaces AIO with a better design) isn't terribly hard to implement, if we're willing to dig down into the JVM internals and implement a modified FileChannel class. If we do we should have a plan to offer that implementation to the OpenJDK project so we don't have to maintain it forever. I've done a bit of syscall level work in the past, and could work on that if it's worth the time investment (not small, by any means, but probably not immense).

Previous Thoughts:

readDataItem has a deserialize parameter. If this is false, then the read is still performed, but no data is returned. There is no way to access the buffer that I can see, so this would be wasted work, and appears to be just a means to adjust the position of the channel (perhaps to read past an item?) or possibly an attempt to get the O/S file cache pre-loaded. We might get significant improvement changing those calls to a seek or reposition (avoiding the data read). A lot depends on how often that path is called in real operation.
The cached byte buffer is not good for the GC, but without generational GC it would be worse to not cache. If/when generational GC becomes available, that should be revisited (actually, all of the cached objects should be revisited).
The close method should null the cached objects, to avoid holding those longer than necessary (and limit possible leaks).
Getting metadata involves opening, reading, then closing a separate file channel. We might gain some performance by reading from the same channel instead of opening two (depends on the access patterns in use).
The reads all seem to use position, but is this necessary? Can we remove the repositioning prior to each read and still be functional? This would improve async access, and also remove the biggest synchronization issue.
FileChannel synchronizes on native read mostly to work with older filesystems (mostly from the Windows world) and some of the remote filesystems that cannot safely read from the same file on multiple threads simultaneously. Is there a mechanism in Java to work around this? Could we use older/lower level APIs in Java to read without the limitations of ByteBuffer and without synchronization? _{author:jsync-swirlds, createdAt:2023-01-20T16:30:21Z, updatedAt=2023-02-15T23:50:30Z}

swirlds-automation commented 1 year ago

deserialize: this parameter is questionable and probably can be removed. readDataItem() has two primary use cases. First, it reads real data, e.g. when data source needs a leaf or internal node from disk. Second, it's used to pre-load data (into OS file cache) in a moment before it will be read later. Please, check VirtualRootNode.warm() for details. _{author:artemananiev, createdAt:2023-01-20T18:57:23Z, updatedAt=2023-01-20T18:57:23Z}

swirlds-automation commented 1 year ago

_{migrated from:
url=https://github.com/swirlds/swirlds-platform/issues/4291
author:jasperpotts, #:4291, createdAt:2021-11-09T01:51:40Z, updatedAt=2023-02-22T02:49:40Z
labels=P2,Performance,Improvement,Platform Data Structures,Migration:Data}

jasperpotts commented 1 year ago

Research on AIO Native APIs

Linux IO_URING

The linux kernel 5.1 and newer includes io_uring API which is a much faster way of doing async network and disk IO.

Video from creator explaining in great detail https://youtu.be/-5T4Cjw46ys - direct to point of performance results https://youtu.be/-5T4Cjw46ys?t=1496

Slide deck on why URING https://kernel-recipes.org/en/2019/talks/faster-io-through-io_uring/

Article on why URING is so important https://dzone.com/articles/the-backend-revolution-or-why-io-uring-is-so-impor

White Paper https://kernel.dk/io_uring.pdf

Documentation https://unixism.net/loti/ref-liburing/submission.html#c.io_uring_prep_readv

Missing Manuals - io_uring worker pool https://blog.cloudflare.com/missing-manuals-io_uring-worker-pool/

liburing

This is a C convenience API over io_uring to make it easier to use, created by IO_Uring author and recommended way to use IO_Uring API. https://github.com/axboe/liburing

Java IO_Uring

So far I have found one commercial use of Uring from Java, in QuestDB. It is also OpenSource and Apache licensed so a good place to learn from. https://questdb.io/blog/2022/09/12/importing-300k-rows-with-io-uring/ https://github.com/questdb/questdb/blob/fa23bf9503e97090f4f6cb5122029f4e75ea4c1f/core/src/main/java/io/questdb/std/IOURingImpl.java#L32

Panama Uring Java

This looks like a complete implementation of Uring using Project Panama in Java, biggest issue is code and docs are all in Chinese. https://github-com.translate.goog/dreamlike-ocean/PanamaUring?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp

fs2-io_uring

Looks very interesting, maybe has some tricks to avoid JNI https://github.com/armanbilge/fs2-io_uring

nio_uring

A Java lib around liburing. It is a one guy code base but seems clean and a good starting point. https://github.com/bbeaupain/nio_uring

Netty Support for IOURING

Netty has a incubator implementation that is quite advanced but does not have support for files. We could propose extending, maybe? https://netty.io/wiki/google-summer-of-code-ideas-2020.html

LibUV

LibUV is a generic async IO API that us what NodeJS uses. It has a io_uring backend in progress since 2018 but the patch has not been approved. If it was improved then this would be a nice way to have fast maintained cross platform async IO support.

https://github.com/libuv/libuv

There is a Java wrapper for libuv

https://github.com/webfolderio/libuv-java

They are working on support in LibUV for io-uring but struggling with API style compatibility. Is a function blocking, polling, has timeout etc.

https://github.com/libuv/libuv/pull/2322

LIB_AIO

This is the async kernel API for IO. It has shown in benchmarks to be much faster in many cases. Especially for cloud servers. It allows a bunch of IO requests to be queued up and answered in any order with async responses. This allows the OS, disk driver, firmware stack to optimize those requests and handle them in parallel. It showed in benchmarks maybe 10x gains for Google GCP instances with local RAID SSDs.

There are a couple libaio libraries we could investigate:

https://github.com/zrlio/jaio

https://activemq.apache.org/components/artemis/documentation/1.0.0/libaio.html

https://github.com/apache/activemq-artemis-native/blob/master/src/main/c/org_apache_activemq_artemis_nativo_jlibaio_LibaioContext.c

Java AIO lib by IBM, outdated but might be handy as reference or starting point. https://github.com/zrlio/jaio

hashgraph / hedera-services

Database should use Async file APIs #5201

Phase 1 - Add Async Base API

Phase 2 - Prototype implementation using IO_Uring and/or LibAIO

Phase 3 - Extend Async API usage up to App Layer

Phase 4 - Final implementation of native code for IO_Uring etc.

Formally bug was:

Research on AIO Native APIs

Linux IO_URING

liburing

Java IO_Uring

Panama Uring Java

fs2-io_uring

nio_uring

Netty Support for IOURING

LibUV

LIB_AIO