ifnesi / 1brc

Gunnar's 1 Billion Row Challenge (Python)
77 stars 83 forks source link

1BRC: One Billion Row Challenge in Python

Python implementation of Gunnar's 1 billion row challenge:

Creating the measurements file with 1B rows

First install the Python requirements:

python3 -m pip install -r requirements.txt

The script createMeasurements.py will create the measurement file:

usage: createMeasurements.py [-h] [-o OUTPUT] [-r RECORDS]

Create measurement file

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Measurement file name (default is "measurements.txt")
  -r RECORDS, --records RECORDS
                        Number of records to create (default is 1_000_000_000)

Example:

% python3 createMeasurements.py
Creating measurement file 'measurements.txt' with 1,000,000,000 measurements...
100%|█████████████████████████████████████████| 100/100 [01:15<00:00,  1.32it/s]
Created file 'measurements.txt' with 1,000,000,000 measurements in 75.86 seconds

Be patient as it can take more than a minute to have the file generated.

Maybe as another challenge is to speed up the generation of the measurements file :slightly_smiling_face:

Performance (on a MacBook Pro M1 32GB)

Interpreter Script user system cpu total
python3 calculateAveragePolars.py 77.84 3.64 703% 11.585
pypy3 calculateAveragePypy.py 139.15
135.25
3.02s
2.92
699%
735%
20.323
18.782
python3 calculateAverageDuckDB.py 186.78 4.21 806% 23.673
pypy3 calculateAverage.py 284.90
242.89
9.12
6.28
749%
780%
39.236
31.926
python3 calculateAverage.py 378.54
329.20
6.94
3.77
747%
793%
51.544
41.941
python3 calculateAveragePypy.py 573.77
510.93
2.70
1.88
787%
793%
73.170
64.660

The script calculateAveragePolars.py was suggested by Taufan on this post.

The script calculateAveragePypy.py was created by donalm, a +2x improved version of the initial script (calculateAverage.py) when running in pypy3, even capable of beating the implementation using DuckDB calculateAverageDuckDB.py.

Olivier Scalbert has made a simple but incredible suggestion where performance increased by an average of 15% (table above has been updated), thank you :slightly_smiling_face:

His suggestions were to change from:

if measurement < result[location][0]:
    result[location][0] = measurement
if measurement > result[location][1]:
    result[location][1] = measurement
result[location][2] += measurement
result[location][3] += 1

to:

_result = result[location]
if measurement < _result[0]:
    _result[0] = measurement
if measurement > _result[1]:
    _result[1] = measurement
_result[2] += measurement
_result[3] += 1

Python can be surprising sometimes.

Compare results

Run compare.sh if you want to check that all the scripts produce the same output.