gunnarmorling / 1brc

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java
https://www.morling.dev/blog/one-billion-row-challenge/
Apache License 2.0
6.08k stars 1.83k forks source link

dpsoft: first submission #572

Closed dpsoft closed 7 months ago

dpsoft commented 7 months ago

Check List:

gunnarmorling commented 7 months ago

Please run mvn verify and append the changes made by the formatter to the PR. Also please make sure that ./test.sh dpsoftpasses.

dpsoft commented 7 months ago

@gunnarmorling done!

gunnarmorling commented 7 months ago

There's a test failure, please see CI for details.

dpsoft commented 7 months ago

@gunnarmorling sorry, done!

gunnarmorling commented 7 months ago

Getting differences for the 1B rows file now:

Validating calculate_average_dpsoft.sh -- measurements_1B.txt
54c54
< Bloemfontein;-31.8;15.6;61.6
---
> Bloemfontein;-33.2;15.6;62.7
282c282
< Ouarzazate;-26.6;18.9;65.0
---
> Ouarzazate;-28.5;18.9;69.8
306c306
< Prague;-41.9;8.4;56.6
---
> Prague;-41.9;8.4;60.5
357c357
< Tamale;-19.3;27.9;74.7
---
> Tamale;-20.9;27.9;76.8
377c377
< Tromsø;-48.5;2.9;49.7
---
> Tromsø;-48.5;2.9;54.3
410c410
< Zanzibar City;-23.0;26.0;79.8
---
> Zanzibar City;-25.2;26.0;79.8

FAILURE: ./test.sh dpsoft measurements_1B.txt failed
gunnarmorling commented 7 months ago

Looking good for the regular 1B file now, but it fails with the 10K key set (see create_measurements3.sh):

java.nio.BufferUnderflowException
    at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:721)
    at java.base/java.nio.DirectByteBuffer.getInt(DirectByteBuffer.java:753)
    at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.hashAndRewind(CalculateAverage_dpsoft.java:157)
    at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.run(CalculateAverage_dpsoft.java:137)
    at java.base/java.lang.Thread.run(Thread.java:1583)
gunnarmorling commented 7 months ago

Running out of heap space now with 10K keys:

Validating calculate_average_dpsoft.sh -- measurements_10K_1B.txt
1c1,10000
< Terminating due to java.lang.OutOfMemoryError: Java heap space
dpsoft commented 7 months ago

@gunnarmorling It seems like the issue lies in the size of the segments, when running the code on my machine it throws an exception related to the size exceeding Integer.MAX_VALUE, while on the test machine, it fails with an "underflow" error.

I've fixed(i think so) the issue and conducted tests using the following approach:

1) ./create_measurements3.sh 1000000000
2) ./calculate_average_baseline.sh > baseline-3.out //changing the path to the 10K key set in the source code.
3) ./calculate_average_dposft.sh > dpost-3.out //changing the path to the 10K key set in the source code.

and finally:

4)  diff baseline-3.out dpsoft-3.out -> nothing!
gunnarmorling commented 7 months ago

Similar error as before. Note that test runs on 32 machines, you should be able to reproduce the issue by using that many chunks rather than relying on CPU count on your machine.

Using java version 21.0.2-graal in this shell.
Validating calculate_average_dpsoft.sh -- measurements_10K_1B.txt
Exception in thread "Thread-20" Exception in thread "Thread-27" Exception in thread "Thread-30" java.nio.BufferUnderflowException
    at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:721)
    at java.base/java.nio.DirectByteBuffer.getInt(DirectByteBuffer.java:753)
    at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.hashAndRewind(CalculateAverage_dpsoft.java:164)
    at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.run(CalculateAverage_dpsoft.java:144)
    at java.base/java.lang.Thread.run(Thread.java:1583)
java.nio.BufferUnderflowException
    at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:721)
    at java.base/java.nio.DirectByteBuffer.getInt(DirectByteBuffer.java:753)
    at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.hashAndRewind(CalculateAverage_dpsoft.java:164)
    at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.run(CalculateAverage_dpsoft.java:144)
    at java.base/java.lang.Thread.run(Thread.java:1583)
java.nio.BufferUnderflowException
    at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:721)
    at java.base/java.nio.DirectByteBuffer.getInt(DirectByteBuffer.java:753)
    at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.hashAndRewind(CalculateAverage_dpsoft.java:164)
    at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.run(CalculateAverage_dpsoft.java:144)
    at java.base/java.lang.Thread.run(Thread.java:1583)
dpsoft commented 7 months ago

@gunnarmorling fixed!

gunnarmorling commented 7 months ago

Hum, hum. So all tests pass now, it also passes the 10K keyset, but it's way too fast, finishing in 100ms (current fastest one is 2.7 sec). I feel it may somehow skip portions if the file?

dpsoft commented 7 months ago

@gunnarmorling I'll take a look

dpsoft commented 7 months ago

@gunnarmorling Indeed was truncating the file because it was looking at the segment size instead of the file size.

gunnarmorling commented 7 months ago

Same thing as before, it seems to skip parts, 100ms for the 10K key set case:

Benchmark 1: timeout -v 300 ./calculate_average_dpsoft.sh 2>&1
  Time (mean ± σ):      98.1 ms ±   3.1 ms    [User: 144.0 ms, System: 31.2 ms]
  Range (min … max):    95.5 ms … 103.0 ms    5 runs

Summary
  dpsoft: trimmed mean 0.09729906015333334, raw times 0.09915400582,0.09552674882000001,0.09679519882,0.09594797582,0.10304930282000001

Leaderboard

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes     |
|---|-----------------|--------------------|-----|---------------|-----------|
|   | 00:00.097 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_dpsoft.java)| 21.0.2-graal | [Diego Parra](https://github.com/dpsoft) |  |
dpsoft commented 7 months ago

@gunnarmorling the file is created with ./create_measurements3.sh 1000000000? and using eight cores for the evaluation?

gunnarmorling commented 7 months ago

Looking good now: 00:06.392. Passing for 10K keyset too.

dpsoft commented 7 months ago

@gunnarmorling Thank you so much for the initiative. I would have liked to have had more free cpu cycles to improve my solution. Perhaps next time! :)

gunnarmorling commented 7 months ago

Hehe, yeah, understood. Unfortunately, I had to make a cut-off date to keep the effort for running it in check. Next time :)

Message ID: @.***>