hosseinmoein / DataFrame

C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
https://hosseinmoein.github.io/DataFrame/
BSD 3-Clause "New" or "Revised" License
2.5k stars 314 forks source link

The statement about "10b rows per column" in README.md needs clarification #333

Open adrian17 opened 2 days ago

adrian17 commented 2 days ago

The statement currently doesn't pass the smell test, as three columns with 10b rows, each an 8-byte double adds up to ~240GB. And looking at the implementation, the values are simply stored in a vector, so a single buffer without any tricks. Can you please either confirm that this number is correct (and explain how this worked, while other libraries didn't manage to work), or fix it.

In a future benchmark, how about once you find a set size where all three libraries work, you also measure and write down peak memory used by the process with standard OS tools? (on my linux box, I used /usr/bin/time -v, but there are probably more ways to do that)

hosseinmoein commented 1 day ago

I have a MacBook Pro with 96gb RAM and an Intel processor. The processor is outdated, since apple now has M2/M3 and so on. But otherwise it is a decent size notebook. It ran 10b in less than 2 hours to completion.

Maybe the reason Polars didn't run is because I ran it through Python. Maybe through Rust it would have run. I ran it through Python because is more convenient and I wasn't concerned about the overhead. According to literature the overhead is negligible as long as you don't have other Python code to process the data, for example Python loops.

adrian17 commented 1 day ago

Question stands: 96GB of RAM doesn't sound like would have handled 240GB of data, unless everything spilled to swap - which also shouldn't happen, as AFAIK swap isn't unbounded on macs.

hosseinmoein commented 1 day ago

I answered your question. There is nothing more I can say. I think there is only one way for you to convince yourself