Open JohannesLichtenberger opened 1 year ago
Main issue can for instance be seen in JsonShredderTest::testChicago
with the following file: https://www.file-upload.net/download-15038251/cityofchicago.zip.html
Small part of the vmstat -1
output while running the test via Gradle in IntelliJ:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 14592 7167968 1042708 10715320 0 0 0 0 12989 59878 53 4 43 0 0
1 0 14592 6860564 1042728 10717444 0 0 8 33008 2955 7607 20 3 77 1 0
1 0 14592 6656196 1042736 10717436 0 0 0 52 1951 2653 14 2 84 0 0
2 0 14592 6608824 1042736 10737152 0 0 0 0 11316 45129 49 3 48 0 0
1 0 14592 6624540 1042752 10750372 0 0 0 33456 3326 19335 21 2 76 1 0
1 0 14592 6650496 1042752 10750308 0 0 0 80 2329 4072 24 1 76 0 0
4 0 14592 6613548 1042752 10761696 0 0 0 196 9845 32601 62 2 36 0 0
1 0 14592 6669560 1042768 10783128 0 0 0 33044 4930 29539 38 3 58 0 0
1 0 14592 6675864 1042768 10783128 0 0 0 0 3311 6018 29 1 70 0 0
9 0 14592 6215208 1042768 10783128 0 0 0 12 4164 6709 19 2 79 0 0
2 0 14592 6142636 1042788 10816328 0 0 0 33072 11459 60453 36 3 60 1 0
1 0 14592 6185552 1042788 10816328 0 0 0 12 2854 4942 18 1 80 0 0
11 0 14592 6183788 1042796 10816328 0 0 0 220 4007 6843 33 1 66 0 0
2 0 14592 6092816 1042812 10849172 0 0 0 32968 11970 60387 40 3 56 1 0
@ikorennoy here's the bug regarding the bad I/O performance.
YourKit Snapshots: gradle-worker-classpath.zip
I believe GSON is already slow to parse the input file. Maybe it's the FileReader
as it's the only instance that uses a FileInputStream
!?
/**
* Create a new {@link JsonReader} instance on a file.
*
* @param path the path to the file
* @return an {@link JsonReader} instance
*/
public static JsonReader createFileReader(final Path path) {
checkNotNull(path);
try {
final var fileReader = new FileReader(path.toFile());
final var jsonReader = new JsonReader(fileReader);
jsonReader.setLenient(true);
return jsonReader;
} catch (final FileNotFoundException e) {
throw new UncheckedIOException(e);
}
}
It might only be slow because of the Sirix code inserting nodes...
@Test
public void testGson() throws IOException {
final var jsonPath = JSON.resolve("cityofchicago.json");
final var reader = JsonShredder.createFileReader(jsonPath);
// Iterate over all nodes.
while (reader.peek() != JsonToken.END_DOCUMENT) {
final var nextToken = reader.peek();
switch (nextToken) {
case BEGIN_OBJECT -> reader.beginObject();
case NAME -> reader.nextName();
case END_OBJECT -> reader.endObject();
case BEGIN_ARRAY -> reader.beginArray();
case END_ARRAY -> reader.endArray();
case STRING, NUMBER -> reader.nextString();
case BOOLEAN -> reader.nextBoolean();
case NULL -> reader.nextNull();
default -> {
}
}
}
}
This reads and parses the file in approx 20 seconds on my machine, which is approx 190 mb/sec, which is okay (not particularly fast, but still), I guess.
We've made some success over here (the traversal of the chicago file in preorder now is around 1:20min on my machine) and depending on if a path summary or not is created insertion time is approximately 2:20min - 2:50min. We might still have the potential for improvements, though. Reading in preorder was around 2:30min before and import was around 4-4:30mins on my machine.
We also should investigate the performance of the coroutine-based descendant axis (which prefetches right siblings during down-traversal in preorder).
A JFR recording with async profiler (IntelliJ Ultimate) inserting the chicago file 5 times in a row (agent options: event=cpu,alloc,lock,cache-misses,interval=1ms,jfrsync=profile
)
A sweet spot seems to be to flush a new revision to disk every 262_144 << 3
nodes. Currently we're approx on average at 2:35min for the import with a path summary with several runs.
With wall clock time enabled (compression seems to be a huge factor): JsonShredderTest_testChicago_2023_01_03_170949.html.tar.gz
We currently face a memory leak...
Hello, @JohannesLichtenberger I would like to help. Is the issue still as the title describes or the performance has been fixed and now the memory leak is the issue?
I think the main issue was, that the page cache max sizes were too high and thus allocation rate of objects was higher than GC could cleanup. Especially with ZGC as it's not generational yet I've had bad pause times somehow... Shenandoah was better, but also not generational yet.
What's strange is that the io uring implementation is slower than the FileChannel based version. Maybe the event loop is an issue somehow...
@Kiriakos1998, do you like to work on this?
https://github.com/sirixdb/sirix/issues/536
Also, this might be interesting: https://github.com/sirixdb/sirix/issues/609
We'd also need JMH tests to better measure performance regarding JSON data... (import/traverse/query)...
Sure I can take a look.
@Kiriakos1998 did you already have had some time to look into the issue(s)?
@Kiriakos1998 did you already have had some time to look into the issue(s)?
Yes, however although I managed to find the index of the iteration(3 or 4) that skyrockets execution time, I was not able to find the exact line of code which is a little bit strange. I can provide with more details if it's needed.
@JohannesLichtenberger To be more specific. As I already mentioned one iteration skyrockets the execution time meaning that the specific iteration is responsible for more than 95% of execution time but when trying to isolate the execution time of each code line in the loop none of them reports any significant execution time.
Maybe it's when reading from the producer / channel kicks in!?
@Kiriakos1998 would you also like to have a look regarding the io_uring backend?
@Kiriakos1998 would you also like to have a look regarding the io_uring backend?
Sure. Regarding this I need a Linux environment correct?
Yes, correct. Then you can try to run JsonShredder::testChicago
for instance and you need the 3,8 Gb test file and copy it into the right place in src/test/resources/json I think
You may then use a debugger. For IntelliJ there's a plugin from Johannes Bechberger. If you like to keep on working on SirixDB I think I have one or two licenses left for IntelliJ Ultimate and the other stuff from JetBrains (but should be only used for SirixDB/Brackit as we got the licenses for free).
I have Intellij Ultimate but I think it will be hard for me to work on this issue. I can create a virtual machine for Linux environment ( my pc has windows) but I think I won't be able to install Intellij there. At least the ultimate version. Do you think it's absolutely necessary to investigate this issue?
I may take a look myself again next week :)
Write speed is only about 10MB/sec, which is rather low and doesn't saturate modern SSDs at all (not even HDDs) and SirixDB should be designed to take advantage of modern flash drive characteristics.