1BRC: One Billion Row Challenge in Python

Python implementation of Gunnar's 1 billion row challenge:

Creating the measurements file with 1B rows

First install the Python requirements:

python3 -m pip install -r requirements.txt

The script createMeasurements.py will create the measurement file:

usage: createMeasurements.py [-h] [-o OUTPUT] [-r RECORDS]

Create measurement file

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Measurement file name (default is "measurements.txt")
  -r RECORDS, --records RECORDS
                        Number of records to create (default is 1_000_000_000)

Example:

% python3 createMeasurements.py
Creating measurement file 'measurements.txt' with 1,000,000,000 measurements...
100%|█████████████████████████████████████████| 100/100 [01:15<00:00,  1.32it/s]
Created file 'measurements.txt' with 1,000,000,000 measurements in 75.86 seconds

Be patient as it can take more than a minute to have the file generated.

Maybe as another challenge is to speed up the generation of the measurements file :slightly_smiling_face:

Performance (on a MacBook Pro M1 32GB)

Interpreter	Script	user	system	cpu	total
python3	calculateAveragePolars.py	77.84	3.64	703%	11.585
pypy3	calculateAveragePypy.py	~~139.15~~ 135.25	~~3.02s~~ 2.92	~~699%~~ 735%	~~20.323~~ 18.782
python3	calculateAverageDuckDB.py	186.78	4.21	806%	23.673
pypy3	calculateAverage.py	~~284.90~~ 242.89	~~9.12~~ 6.28	~~749%~~ 780%	~~39.236~~ 31.926
python3	calculateAverage.py	~~378.54~~ 329.20	~~6.94~~ 3.77	~~747%~~ 793%	~~51.544~~ 41.941
python3	calculateAveragePypy.py	~~573.77~~ 510.93	~~2.70~~ 1.88	~~787%~~ 793%	~~73.170~~ 64.660

The script calculateAveragePolars.py was suggested by Taufan on this post.

The script calculateAveragePypy.py was created by donalm, a +2x improved version of the initial script (calculateAverage.py) when running in pypy3, even capable of beating the implementation using DuckDB calculateAverageDuckDB.py.

Olivier Scalbert has made a simple but incredible suggestion where performance increased by an average of 15% (table above has been updated), thank you :slightly_smiling_face:

His suggestions were to change from:

if measurement < result[location][0]:
    result[location][0] = measurement
if measurement > result[location][1]:
    result[location][1] = measurement
result[location][2] += measurement
result[location][3] += 1

to:

_result = result[location]
if measurement < _result[0]:
    _result[0] = measurement
if measurement > _result[1]:
    _result[1] = measurement
_result[2] += measurement
_result[3] += 1

Python can be surprising sometimes.

Compare results

Run compare.sh if you want to check that all the scripts produce the same output.

ifnesi / 1brc

readme

1BRC: One Billion Row Challenge in Python

Creating the measurements file with 1B rows

Performance (on a MacBook Pro M1 32GB)

Compare results