gunnarmorling / 1brc

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java
https://www.morling.dev/blog/one-billion-row-challenge/
Apache License 2.0
6.3k stars 1.88k forks source link

Incorrect results with short city names #276

Closed viliam-durina closed 10 months ago

viliam-durina commented 10 months ago

https://github.com/gunnarmorling/1brc/blob/fa1ca65bfd608fb117181de07320d8b568d30f12/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java#L145

@royvanrijn

This code in the (currently) best solution will give incorrect results if there happen to be two semicolons in a single word. This is possible since a minimal record is 6 bytes, e.g.:

a;0.0
a;0.0
...
AlexanderYastrebov commented 10 months ago

I've added short test cases #277 which royvanrijn passes. Maybe you could provide a test case to show the problem?

hundredwatt commented 10 months ago

This is also covered in an existing test case: https://github.com/gunnarmorling/1brc/blob/fa1ca65bfd608fb117181de07320d8b568d30f12/src/test/resources/samples/measurements-complex-utf8.txt#L15-L16

viliam-durina commented 10 months ago

My fault, I didn't actually run the code at all. I only assumed that the hash can be calculated differently for the same city name, and that it must lead to incorrect results.

Today I debugged and indeed two entries can be created for the same city in MeasurementRepository. But it's resolved in the final combine step when the String city field of MeasurementRepository.Entry is used as the key for the TreeMap, and this field is calculated correctly, so the end results are correct.