ikorennoy / jasyncfio

Java asynchronous file I/O based on io_uring Linux interface
Apache License 2.0
71 stars 10 forks source link

Cannot allocate memory after a few tests are successful #73

Closed JohannesLichtenberger closed 1 year ago

JohannesLichtenberger commented 1 year ago

Hi,

after a few tests in sirix-core are successful without any failures any subsequent test fails, because it can't allocate enough memory somehow.

failed to allocate memory for io_uring ring; Cannot allocate memory
java.lang.RuntimeException: failed to allocate memory for io_uring ring; Cannot allocate memory
    at one.jasyncfio.Native.setupIoUring0(Native Method)
    at one.jasyncfio.Native.setupIoUring(Native.java:32)
    at one.jasyncfio.Ring.<init>(Ring.java:43)
    at one.jasyncfio.PollRing.<init>(PollRing.java:23)
    at one.jasyncfio.EventExecutorImpl.<init>(EventExecutorImpl.java:125)
    at one.jasyncfio.EventExecutor$Builder.build(EventExecutor.java:183)
    at one.jasyncfio.EventExecutor.initDefault(EventExecutor.java:205)
    at org.sirix.io.iouring.IOUringStorage.<init>(IOUringStorage.java:62)
    at org.sirix.io.StorageType$5.getInstance(StorageType.java:87)
    at org.sirix.io.StorageType.getStorage(StorageType.java:112)

However, via debugger I can see that the storage.close() method is executed:

  @Override
  public void close() {
    try {
      if (revisionsOffsetFile != null) {
        revisionsOffsetFile.close().join();
        revisionsOffsetFileEventExecutor.close();
      }
      dataFile.close().join();
      dataFileEventExecutor.close();
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
  }

The executors are initialized with the default factory method.

ikorennoy commented 1 year ago

Hi!

I think I've encountered a similar problem too, but it seems to have been resolved when I made sure I called the close method for the EventExecutor everywhere.

The thing is that the queue allocation for io_uring happens under the rlimit memlocked parameter, which can be quite low on some setups (64K). So I recommend increasing its value.

Also, I looked at your implementation of io_uring based storage. There is definitely no point in creating two EventExecutor, I recommend keeping one. And to reduce memory consumption, I recommend creating EventExecutor not with default parameters, but with parameter, entries equal to 1024, if not enough and strange errors or hangs appear, then up to 2048, which I think should be enough (default 4096).

And most importantly, make sure the close method of EventExecutor is invoked everywhere when SirixDB stops.

If the problem won't go away, please let me know.

JohannesLichtenberger commented 1 year ago

Hmm, it should definitely be closed, but maybe keeping it alive for the whole session is too long (sharing for a couple of transactions for instance). Maybe it should be created per Reader/Writer And already closed once they are closed instead of once the session and the underlying storage is closed?

JohannesLichtenberger commented 1 year ago

I'll make the changes based on your suggestions, too :)

JohannesLichtenberger commented 1 year ago

So, I'm not getting the error anymore. However I've "fixed" a few issues, such that reading and writing are always done in a virtual thread:

  @NotNull
  private IOUringWriter writePageReference(final PageReadOnlyTrx pageReadOnlyTrx, final PageReference pageReference,
      Bytes<ByteBuffer> bufferedBytes, long offset) {
    try {
      POOL.submit(() -> writePage(pageReadOnlyTrx, pageReference, bufferedBytes, offset)).get();
      return this;
    } catch (InterruptedException | ExecutionException e) {
      throw new SirixIOException(e);
    }
  }

and ExecutorService POOL = Executors.newVirtualThreadPerTaskExecutor();

however, performance is considerably worse, and CPUs are now at 100%:

Screenshot from 2022-11-11 18-02-09

ikorennoy commented 1 year ago

Maybe it should be created per Reader/Writer And already closed once they are closed instead of once the session and the underlying storage is closed?

No, the EventExecutor must be created once and live for the entire lifetime of the DB instance.

however, performance is considerably worse, and CPUs are now at 100%

I've run unit tests using io_uring, there is indeed a significant drop in performance. This may be because when using io_uring there is an overhead to create the io_uring queue, which becomes significant due to constant re-creation of the context in the unit tests, but I'm not sure.

If you could provide an example of using SirixDB that does a significant amount of I/O work, I'd be happy to try and work out what's wrong. I've seen in benchmarks that a 10 mb file is used, it would be nice to have a similar example that is easy to run with a several gigabyte file.

JohannesLichtenberger commented 1 year ago

I know. There's also the revisions cache which should be a cache per resource and initialized statically once for the lifetime of the JVM instance (should hopefully be relatively easy to change -- maybe a concurrent map, which maps the resource path to the caffeine caches).

I'm unsure where to put the EventExecutor as it should be closed at some point.

Also, look at my current configuration of the EventExecutor in the IOUringStorage class. It might well be stupid.

I'll try to upload the 3,8 Gb cityofchicago.json file somewhere. Then you can enable the Test in JsonShredderTest::testChicago. You should probably also use the Shenandoah GC (but one param I mentioned in the comment is not working anymore in Java 19).

I guess the virtual thread stuff is not working correctly with the io_uring library. If you, for instance, comment the lines with POOL::submit in both the reader and writer implementation, it might be much faster (but still, I think slightly worse than the FileChannel or memory-mapped I/O layers).

https://www.file-upload.net/download-15038251/cityofchicago.zip.html

Thanks a lot for all your effort. I appreciate your knowledge and help :+1:

JohannesLichtenberger commented 1 year ago

Let me know if you can download the file. It seems to be a bit weird... Maybe I have to find another file upload service. Also, I wonder how we can automatically run these tests in another test-project. I think even Git LFS doesn't support these very large files or you have to pay a lot.

JohannesLichtenberger commented 1 year ago

I've made the EventExecutor static... one problem might be that it's now never closed, however. Despite the 100% CPU usage being gone, I think latency is still a bit worse than the other backends.

JohannesLichtenberger commented 1 year ago

It seems to help a bit to dramatically reduce the time to sleep interval for the event loop. The read-write trx in the test during an import commits itself periodically, that is it flushes the in-memory trx intent log to disk everytime during a commit in a postorder traversal of the changed page fragments. I'll have to check the performance of the read-only trx again. Nontheless, I'll have to read the Systems Performance book by Brandon Gregg, to figure out what really goes on in such situations.

Apparently the event loop occupies too much CPU time, but the only thing which is done in parallel is reading page fragments if a page has been changed in more than one revision (for instance due to hash updates for the ancestors of inserted nodes) and serialization of the UnorderedKeyValuePage and one of 8 cores doesn't sound that much.

Maybe it would be great to have an initial time to sleep value which is very low for the rrvent loop, but during a commit/sync to disk set a much higher value?

But even after setting the time to sleep to a very low value (1-10ms) it's about 25sec slower than the file channel based implementation (around 3:50m vs 4:15m).

So maybe that's only part of the story. In general I also want to switch to an off-heap map as Chronicle Map for the trx intent log, which should have an impact of several seconds in this test, hopefully (due to much lower garbage).

ikorennoy commented 1 year ago

@JohannesLichtenberger, I ran some tests and created a PR with a configuration that on my PC is consistently slightly faster in the test than the other I/O backends, while seemingly consuming less CPU (definitely not more) :)

JohannesLichtenberger commented 1 year ago

I just wrote a comment in the PR and I think I somehow have a serious performance bug writing to disk, but not sure why, though. Maybe I should try the auto-commit after more data has been acquired in-memory (currently I think in the test case 250_000 nodes are gathered (1024 nodes at max per page).

JohannesLichtenberger commented 1 year ago

Disk I/O usage indeed is terrible:

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
nvme0n1         309,50         0,00     10452,00         0,00          0      20904          0
nvme0n1p1         0,00         0,00         0,00         0,00          0          0          0
nvme0n1p2         0,00         0,00         0,00         0,00          0          0          0
nvme0n1p3       309,50         0,00     10452,00         0,00          0      20904          0

It's, on average, maybe about 1.75 x better if you disable all options (storing no hashes, no path summary...), but it's still way too low, I guess.

(but this is not about your io_uring lib -- though I couldn't measure the 20sec faster than the other backends, too, that you mentioned somehow; in my test when switching from a file channel based backend to the io_uring based approach, it's 20sec slower on my machine (constantly)). That said, the main issue and insight for me currently is the horrible disk I/O usage :-/

JohannesLichtenberger commented 1 year ago

output from vmstat -1

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0  14592 7167968 1042708 10715320    0    0     0     0 12989 59878 53  4 43  0  0
 1  0  14592 6860564 1042728 10717444    0    0     8 33008 2955 7607 20  3 77  1  0
 1  0  14592 6656196 1042736 10717436    0    0     0    52 1951 2653 14  2 84  0  0
 2  0  14592 6608824 1042736 10737152    0    0     0     0 11316 45129 49  3 48  0  0
 1  0  14592 6624540 1042752 10750372    0    0     0 33456 3326 19335 21  2 76  1  0
 1  0  14592 6650496 1042752 10750308    0    0     0    80 2329 4072 24  1 76  0  0
 4  0  14592 6613548 1042752 10761696    0    0     0   196 9845 32601 62  2 36  0  0
 1  0  14592 6669560 1042768 10783128    0    0     0 33044 4930 29539 38  3 58  0  0
 1  0  14592 6675864 1042768 10783128    0    0     0     0 3311 6018 29  1 70  0  0
 9  0  14592 6215208 1042768 10783128    0    0     0    12 4164 6709 19  2 79  0  0
 2  0  14592 6142636 1042788 10816328    0    0     0 33072 11459 60453 36  3 60  1  0
 1  0  14592 6185552 1042788 10816328    0    0     0    12 2854 4942 18  1 80  0  0
11  0  14592 6183788 1042796 10816328    0    0     0   220 4007 6843 33  1 66  0  0
 2  0  14592 6092816 1042812 10849172    0    0     0 32968 11970 60387 40  3 56  1  0

And a flame graph will wall clock time

Screenshot from 2022-11-16 17-38-09

ikorennoy commented 1 year ago

I close this task as the discussion is no longer relevant to jasyncfio. I suggest creating a similar task to investigate in the SirixDB repository. Once created, please add me as a participant, in case I can help in any way, I will write :)