Closed dpsoft closed 7 months ago
Please run mvn verify
and append the changes made by the formatter to the PR. Also please make sure that ./test.sh dpsoft
passes.
@gunnarmorling done!
There's a test failure, please see CI for details.
@gunnarmorling sorry, done!
Getting differences for the 1B rows file now:
Validating calculate_average_dpsoft.sh -- measurements_1B.txt
54c54
< Bloemfontein;-31.8;15.6;61.6
---
> Bloemfontein;-33.2;15.6;62.7
282c282
< Ouarzazate;-26.6;18.9;65.0
---
> Ouarzazate;-28.5;18.9;69.8
306c306
< Prague;-41.9;8.4;56.6
---
> Prague;-41.9;8.4;60.5
357c357
< Tamale;-19.3;27.9;74.7
---
> Tamale;-20.9;27.9;76.8
377c377
< Tromsø;-48.5;2.9;49.7
---
> Tromsø;-48.5;2.9;54.3
410c410
< Zanzibar City;-23.0;26.0;79.8
---
> Zanzibar City;-25.2;26.0;79.8
FAILURE: ./test.sh dpsoft measurements_1B.txt failed
Looking good for the regular 1B file now, but it fails with the 10K key set (see create_measurements3.sh):
java.nio.BufferUnderflowException
at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:721)
at java.base/java.nio.DirectByteBuffer.getInt(DirectByteBuffer.java:753)
at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.hashAndRewind(CalculateAverage_dpsoft.java:157)
at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.run(CalculateAverage_dpsoft.java:137)
at java.base/java.lang.Thread.run(Thread.java:1583)
Running out of heap space now with 10K keys:
Validating calculate_average_dpsoft.sh -- measurements_10K_1B.txt
1c1,10000
< Terminating due to java.lang.OutOfMemoryError: Java heap space
@gunnarmorling It seems like the issue lies in the size of the segments, when running the code on my machine it throws an exception related to the size exceeding Integer.MAX_VALUE, while on the test machine, it fails with an "underflow" error.
I've fixed(i think so) the issue and conducted tests using the following approach:
1) ./create_measurements3.sh 1000000000
2) ./calculate_average_baseline.sh > baseline-3.out //changing the path to the 10K key set in the source code.
3) ./calculate_average_dposft.sh > dpost-3.out //changing the path to the 10K key set in the source code.
and finally:
4) diff baseline-3.out dpsoft-3.out -> nothing!
Similar error as before. Note that test runs on 32 machines, you should be able to reproduce the issue by using that many chunks rather than relying on CPU count on your machine.
Using java version 21.0.2-graal in this shell.
Validating calculate_average_dpsoft.sh -- measurements_10K_1B.txt
Exception in thread "Thread-20" Exception in thread "Thread-27" Exception in thread "Thread-30" java.nio.BufferUnderflowException
at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:721)
at java.base/java.nio.DirectByteBuffer.getInt(DirectByteBuffer.java:753)
at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.hashAndRewind(CalculateAverage_dpsoft.java:164)
at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.run(CalculateAverage_dpsoft.java:144)
at java.base/java.lang.Thread.run(Thread.java:1583)
java.nio.BufferUnderflowException
at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:721)
at java.base/java.nio.DirectByteBuffer.getInt(DirectByteBuffer.java:753)
at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.hashAndRewind(CalculateAverage_dpsoft.java:164)
at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.run(CalculateAverage_dpsoft.java:144)
at java.base/java.lang.Thread.run(Thread.java:1583)
java.nio.BufferUnderflowException
at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:721)
at java.base/java.nio.DirectByteBuffer.getInt(DirectByteBuffer.java:753)
at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.hashAndRewind(CalculateAverage_dpsoft.java:164)
at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.run(CalculateAverage_dpsoft.java:144)
at java.base/java.lang.Thread.run(Thread.java:1583)
@gunnarmorling fixed!
Hum, hum. So all tests pass now, it also passes the 10K keyset, but it's way too fast, finishing in 100ms (current fastest one is 2.7 sec). I feel it may somehow skip portions if the file?
@gunnarmorling I'll take a look
@gunnarmorling Indeed was truncating the file because it was looking at the segment size instead of the file size.
Same thing as before, it seems to skip parts, 100ms for the 10K key set case:
Benchmark 1: timeout -v 300 ./calculate_average_dpsoft.sh 2>&1
Time (mean ± σ): 98.1 ms ± 3.1 ms [User: 144.0 ms, System: 31.2 ms]
Range (min … max): 95.5 ms … 103.0 ms 5 runs
Summary
dpsoft: trimmed mean 0.09729906015333334, raw times 0.09915400582,0.09552674882000001,0.09679519882,0.09594797582,0.10304930282000001
Leaderboard
| # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes |
|---|-----------------|--------------------|-----|---------------|-----------|
| | 00:00.097 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_dpsoft.java)| 21.0.2-graal | [Diego Parra](https://github.com/dpsoft) | |
@gunnarmorling the file is created with ./create_measurements3.sh 1000000000? and using eight cores for the evaluation?
Looking good now: 00:06.392. Passing for 10K keyset too.
@gunnarmorling Thank you so much for the initiative. I would have liked to have had more free cpu cycles to improve my solution. Perhaps next time! :)
Hehe, yeah, understood. Unfortunately, I had to make a cut-off date to keep the effort for running it in check. Next time :)
Message ID: @.***>
Check List:
[x] Tests pass (
./test.sh <username>
shows no differences between expected and actual outputs)[x] All formatting changes by the build are committed
[x] Your launch script is named
calculate_average_<username>.sh
(make sure to match casing of your GH user name) and is executable[x] Output matches that of
calculate_average_baseline.sh
[x] For new entries, or after substantial changes: When implementing custom hash structures, please point to where you deal with hash collisions (237)
Execution time: 11s
Execution time of reference implementation: 4min 30s
System specs: 8 Core i7 @ 2.80GHz, 16GB