hashgraph / hedera-services

Crypto, token, consensus, file, and smart contract services for the Hedera public ledger
Apache License 2.0
283 stars 124 forks source link

Database should use Async file APIs #5201

Open swirlds-automation opened 1 year ago

swirlds-automation commented 1 year ago

All the Java APIs for disk access use the older synchronous APIs for disk IO. We know from testing there is a 2-10x gain in IOPS if we move to a async API like AIO or IO_URING. Nether of these are available in Java out of the box so we will need to develop a custom native solution. Maybe an alternative implementation of AsynchronousFileChannel might work as a API (Via a new FileSystem Implementation though java.nio.file.spi might work).

We don't know when we are going to need that performance boost but this is a good candidate to get a large boost.

We have tested all the available File APIs in the JDK and not found anything faster than what we are doing with FileChannels. The next step will be to look at using a better low level Async OS API. Such as IO_Uring or AIO, connected to Java with JNI or Project Panama.

Oleg had a good suggestion for a phased plan:

Phase 1 - Add Async Base API

Design and create a new async API that can be used by lowest level of database. It can be implemented either using java AsynchronousFileChannel or just a thread pool over file channels. But should be designed so that it can be implemented with LibAIO or IO_Uring native API. Update base part of the DB/VM like compactions and hashing to use new APIs. It might make sense for this API to be in PBJ IO so that it can be fully integrated and work natively in PBJ types like Bytes.

Phase 2 - Prototype implementation using IO_Uring and/or LibAIO

Pick one or more native AIO library and implement prototype implementations, maybe try multiple and benchmark on a linux server with main net spec.

Phase 3 - Extend Async API usage up to App Layer

Expose Async API up to VirtualMap layer so it can be used by the App if we think there is performance benefit.

Phase 4 - Final implementation of native code for IO_Uring etc.

Make the chosen native implementation production ready.

Formally bug was:

DataFileReader is used from a large number of threads concurrently:

  • 20+ Hashing Threads
  • Pre-handle get for modify threads
  • Merging Threads
  • Reconnect Threads
  • Pipeline thread

etc. It used to handle this well before we fixed an issue found by Sonar with ThreadLocal handling. We used to create a new FileChannel for each thread and file pair. This was great for performance but would leek file handles over time as files came and went. With thread-local file channels there is no nice way to find every thread that has ever read from a given file when they file is deleted so that we can clean up and close those file channels.

So the way we do it today is to share one FileChannel per file between all threads that read from that file. On the surface that looked like a great solution as FileChannel and the underlying pread() unix command it uses support multiple concurrent reads. The problem is the FileChannel uses a synchronized method, this means all the reading threads fight over that lock, causing huge contention and slowing things down by at least half.

So the challenge is how to support multiple unknown threads to read from a file concurrently. The nice solution seems to be to memory-map the files which will allow multiple readers and offer other performance benefits. The problem is the Java memory-map API has limitations, the biggest of which is there is no API for unmapping a file. There is a workaround using sun.misc.Unsafe, which might work and is often used it just needs VERY careful handling or it can cause seg-faults. Even if we work around that problem then we hit the problem that ByteBuffer can only address 2GB of a file at a time because it's indexes are integers. This can be worked around by limiting the max file size to 2GB or by using multiple ByteBuffer mapping to map 2Gb regions of the file.

The nicer solution seems to be to use the new Java 17 - JEP 393: Foreign-Memory Access API (Third Incubator). It supports mapping files larger than 2Gb, controlled unmapping and safe multithreaded access. This will need to wait till we are building Swirlds with Java 17.

swirlds-automation commented 1 year ago

There are two issues with FileChannel read performance:

swirlds-automation commented 1 year ago

Low hanging fruit is to remove the call to open the file channel in the constructor. Quick debugging and stats collection shows that a non-trivial number of DataFileReader is constructed and never written to. But the size call is used. Shortly after, the instance is closed.

Thus, an underlying code path like the following is seen (note: these calls are distributed in the constructor and close call).

  FileChannel channel = FileChannel.open(path);
  int size = channel.size();
  channel.close();

One slight optimization would be to avoid opening/closing the file channel and just directly use:

Files.size(path)

to fetch the file size.

Profiling shows the following performance characteristics for the various code:

Files.size(path)  

// Profiled:  644336.744 ± 4939.246  ops/s
  FileChannel channel = FileChannel.open(path);
  int size = channel.size();
  channel.close();

// Profiled: 28231.221 ±  557.278  ops/s

And just simply opening/closing the file:

  FileChannel channel = FileChannel.open(path);
  channel.close();

// Profiled: 28575.806 ±  659.412  ops/s

author:rjuang, createdAt:2022-06-24T18:59:33Z, updatedAt=2022-06-24T18:59:33Z

swirlds-automation commented 1 year ago

Ray(TODO):

swirlds-automation commented 1 year ago

I created a benchmark to check how DataFileReader behaves on concurrent load (multiple files, multiple threads) using different approaches: FileChannel (current implementation), MappedByteBuffer, and MemorySegment (incubator level feature in Java 17).

Here is what benchmark does:

Results are:

File Size FS Files F Threads T Block Size K FileChannel s/op MappedByteBuffer s/op MemorySegment s/op
1Gb 1 10 4K 3.1 0.9 1
16K 1.3 0.8 0.8
256K 0.9 0.8 0.8
10 10 4K 1.9 1.3 1.4
16K 1.2 1.2 1.1
256K 1 1 1
100 10 4K 21 26 25
16K 7.1 11 11
256K 1.7 5 5
200 10 256K 2 6.5 6.8

FileChannel is slow when a number of files is low. It may be because of benchmark logic: it reads FS bytes total, so if they are all in a single file, it may be slower than if load is spread across multiple files. With high number of files, FileChannel performs the best. MappedByteBuffer and MemorySegment performances are roughly equal, they work very well with a single file, but not so great with many files.

Without chunkify, I assume our typical read pattern is closer to 4K bytes rather than to 256K. With block sizes like this and medium to high number of files, performance of all 3 methods is roughly the same.

@OlegMazurov @jasperpotts @rjuang @deepak-swirlds FYI. author:artemananiev, createdAt:2022-09-26T15:02:05Z, updatedAt=2022-09-26T15:08:58Z

swirlds-automation commented 1 year ago

One more thing to consider. MappedByteBuffer and MemorySegment have fundamental API restriction that all indices are integers, so they can't be used for 2Gb+ files. A workaround is to split a large file into 2Gb chunks and use them as if they were separate files. This is why I tested 100 and even 200 files with the benchmarks. FileChannel is different, it can be used with huge files. So I tested one more mode: instead of 100 x 1Gb files I checked how good or bad is FileChannel with 1 x 100Gb file. Surprisingly, there is no difference (at least, on my Mac). author:artemananiev, createdAt:2022-09-26T15:35:53Z, updatedAt=2022-09-26T15:35:53Z

swirlds-automation commented 1 year ago

file-reads-bench.zip

author:artemananiev, createdAt:2022-10-17T16:54:53Z, updatedAt=2022-10-17T16:54:53Z

swirlds-automation commented 1 year ago

There appears to be some confusion here as to what needs to be done, and where the source of slowness is. I suspect there are some improvements to how the Channel is being used that can yield benefits, however. Memory Mapped files carry quite a bit of overhead in Java, so that may not be the best option, particularly if we want to be deployable on any O/S other than Linux (memory mapped files have wildly varying performance characteristics across other O/S's, in my experience).

I would like to do some research into FileChannel and see if there are any options to use Channels for access (so we don't have the risks of external memory, or the complexity of segmenting files), but still avoid synchronized blocks and thread contention. We should be able to use a single read-only channel from a couple hundred virtual threads (or equivalent) without blocking any of them. Specific items to look into (mostly summarized from above):

Doing this properly will likely take quite some time, however, so we'll need to discuss how much time is available, and whether we make some small improvements immediately, and work the larger improvements later, or try to achieve consistent high performance up front. author:jsync-swirlds, createdAt:2023-01-19T18:45:35Z, updatedAt=2023-01-20T15:11:17Z

swirlds-automation commented 1 year ago

Additional items to investigate:

Previous Thoughts:

swirlds-automation commented 1 year ago

deserialize: this parameter is questionable and probably can be removed. readDataItem() has two primary use cases. First, it reads real data, e.g. when data source needs a leaf or internal node from disk. Second, it's used to pre-load data (into OS file cache) in a moment before it will be read later. Please, check VirtualRootNode.warm() for details. author:artemananiev, createdAt:2023-01-20T18:57:23Z, updatedAt=2023-01-20T18:57:23Z

swirlds-automation commented 1 year ago

migrated from: url=https://github.com/swirlds/swirlds-platform/issues/4291 author:jasperpotts, #:4291, createdAt:2021-11-09T01:51:40Z, updatedAt=2023-02-22T02:49:40Z labels=P2,Performance,Improvement,Platform Data Structures,Migration:Data

jasperpotts commented 1 year ago

Research on AIO Native APIs

Linux IO_URING

The linux kernel 5.1 and newer includes io_uring API which is a much faster way of doing async network and disk IO.

Video from creator explaining in great detail https://youtu.be/-5T4Cjw46ys - direct to point of performance results https://youtu.be/-5T4Cjw46ys?t=1496

Slide deck on why URING https://kernel-recipes.org/en/2019/talks/faster-io-through-io_uring/

Article on why URING is so important https://dzone.com/articles/the-backend-revolution-or-why-io-uring-is-so-impor

White Paper https://kernel.dk/io_uring.pdf

Documentation https://unixism.net/loti/ref-liburing/submission.html#c.io_uring_prep_readv

Missing Manuals - io_uring worker pool https://blog.cloudflare.com/missing-manuals-io_uring-worker-pool/

liburing

This is a C convenience API over io_uring to make it easier to use, created by IO_Uring author and recommended way to use IO_Uring API. https://github.com/axboe/liburing

Java IO_Uring

So far I have found one commercial use of Uring from Java, in QuestDB. It is also OpenSource and Apache licensed so a good place to learn from. https://questdb.io/blog/2022/09/12/importing-300k-rows-with-io-uring/ https://github.com/questdb/questdb/blob/fa23bf9503e97090f4f6cb5122029f4e75ea4c1f/core/src/main/java/io/questdb/std/IOURingImpl.java#L32

Panama Uring Java

This looks like a complete implementation of Uring using Project Panama in Java, biggest issue is code and docs are all in Chinese. https://github-com.translate.goog/dreamlike-ocean/PanamaUring?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp

fs2-io_uring

Looks very interesting, maybe has some tricks to avoid JNI https://github.com/armanbilge/fs2-io_uring

nio_uring

A Java lib around liburing. It is a one guy code base but seems clean and a good starting point. https://github.com/bbeaupain/nio_uring

Netty Support for IOURING

Netty has a incubator implementation that is quite advanced but does not have support for files. We could propose extending, maybe? https://netty.io/wiki/google-summer-of-code-ideas-2020.html

LibUV

LibUV is a generic async IO API that us what NodeJS uses. It has a io_uring backend in progress since 2018 but the patch has not been approved. If it was improved then this would be a nice way to have fast maintained cross platform async IO support.

https://github.com/libuv/libuv

There is a Java wrapper for libuv

https://github.com/webfolderio/libuv-java

They are working on support in LibUV for io-uring but struggling with API style compatibility. Is a function blocking, polling, has timeout etc.

https://github.com/libuv/libuv/pull/2322

LIB_AIO

This is the async kernel API for IO. It has shown in benchmarks to be much faster in many cases. Especially for cloud servers. It allows a bunch of IO requests to be queued up and answered in any order with async responses. This allows the OS, disk driver, firmware stack to optimize those requests and handle them in parallel. It showed in benchmarks maybe 10x gains for Google GCP instances with local RAID SSDs.

There are a couple libaio libraries we could investigate:

https://github.com/zrlio/jaio

https://activemq.apache.org/components/artemis/documentation/1.0.0/libaio.html

https://github.com/apache/activemq-artemis-native/blob/master/src/main/c/org_apache_activemq_artemis_nativo_jlibaio_LibaioContext.c

Java AIO lib by IBM, outdated but might be handy as reference or starting point. https://github.com/zrlio/jaio