ifnesi / 1brc

Gunnar's 1 Billion Row Challenge (Python)
77 stars 83 forks source link

Faster create #2

Closed fizmat closed 8 months ago

fizmat commented 8 months ago

It takes 1130s on my machine (Apple M1) to generate the data.

I hoped generating the measurements with numpy in batches would be enough for a good speedup. But writing records in a loop is also very slow. Total time ~ 760s

I tried using pandas for csv output, it was slower.

Writing csv in polars is very fast, but converting the array of randomly selected station names into a polars column is slow. Total time ~ 480s

By using polars instead of numpy to sample the stations, this slow conversion can be avoided, so finally creating the 1brc dataset takes just 71s.

ifnesi commented 8 months ago

Hi @fizmat , that is really cool, thank you very much for your contribution