ifnesi / 1brc

Gunnar's 1 Billion Row Challenge (Python)
77 stars 83 forks source link

added new implementation using bytearray and memoryview #11

Closed Skazu closed 4 months ago

Skazu commented 4 months ago

Hey there, i've added a new implementation using a bytearray and memoryview to work on a fixed allocated memory buffer.

As in the Pypy implementation the file gets distributed on all cpus via multiprocessing. I create n chunks (where n is the number of cpus available), rearrange each chunk to end/start at a whole line and spawn the processes.

But from there i've changed a lot, i allocate a buffer of configurable size, in this version 1024 * 128 bytes, and read the file directly into this buffer via the readinto1(buffer) method, after that i operate as much as possible on this fixed buffer, searching for \n and ; to split the lines. If there is no \n left in the buffer i read the next part of the file until i reach the end.

On my machine this is even faster than the current Pypy solution, and also has a fixed size memory footprint, i don't need to disable the garbage collection, because there is no garbage created. (Even with disabled gc the memory footprint doesn't rise, whereas the pypy version uses all available ram, until my system freezes).

I don't know how fast my code is on your reference machine, but i'm really curious to find out, maybe you can try it out?

ifnesi commented 4 months ago

Hi @Skazu , thank you very much. On my machine your implementation featured on the 3rd place: | pypy3 | calculateAveragePypyInputBuffer.py | 145.58 | 5.08 | 670% | 22.475 | For some reason I am unable to commit the changes on the README.md file, I am trying to understand why.