Closed jmakov closed 3 years ago
Also, was DataFrame compiled with -O3 flag?
Thanks
I followed the build instructions in the README e.g. cmake -DCMAKE_BUILD_TYPE=Release .. && make -j 32
without any other settings/modifications. Hope this answers your question.
Thanks for the PR. Can you please update points 4 and 5 according to your latest data? Point 4 is actually very important, because it measures the processing of data and how much difference the memory layout of two libraries make.
Best HM
Point 5 still stands (sorry for not mentioning it before). Regarding point 4, I'm not sure how you got to the 21x factor in the first place, that's why I asked if you could also share some thoughts on these numbers about that :).
Point 5 still stands (sorry for not mentioning it before). Regarding point 4, I'm not sure how you got to the 21x factor in the first place, that's why I asked if you could also share some thoughts on these numbers about that :).
I used to have a few more lines of code in dataframe_performance.cc/pandas_performance.py that printed the time at the beginning of the program, after loading the data (before calculating means), and at the end. That's identical to what I currently have in dataframe_performace_2.cc/pandas_performance_2.py. That way I knew exactly how long the data loading took as compared to calculating means. You could add those lines back from the xxxx_2
tests to the original tests.
I really don't want to compare the random number generation + data loading part, because both DataFrame and Pandas are using the same underlying libraries. The value added by DataFrame is how it lays out the data and how it iterates through it.
Thank you very much for that insight (didn't know you're timing stuff in those files). I've added another commit, let me know if you have any other thoughts.
Thank you very much for that insight (didn't know you're timing stuff in those files). I've added another commit, let me know if you have any other thoughts.
Sorry I wasn't clear on my explanation. I didn't mean for you to run and print stats for dataframe_performace_2.cc/pandas_performance_2.py
(you don't need to run dataframe_performace_2.cc/pandas_performance_2.py at all). I meant to add the timing printouts to dataframe_performace.cc/pandas_performance.py
source codes and then run them.
But I already added the timings. Please pull from master, make, and run dataframe_performace.cc/pandas_performance.py
on your environment. Now you can use the additional timing printouts info to adjust point 4. Point number 4 compares only the time it takes to calculate means.
Thanks, HM
You were clear. I just compared quickly the dataframe_performance.cc
and dataframe_performance_2.cc
, looked like you're doing a similar thing so I just ran the _2 files.
In the new measurements the 1 second resolution is perhaps a bit problematic - best run from DataFrame is 1s, best run from Pandas 11s. Do you want to pursue this further (and we work on e.g. ms timings)? I've updated the README if you decide to be done with it.
Raw data:
python3 test/pandas_performance.py
Starting ... 1629817412
All memory allocations are done. Calculating means ... 1629817642
8.734577986843534e-06, 1.6487959965745378, 0.9999620449662165
1629817654 ... Done
$ python3 test/pandas_performance.py
Starting ... 1629817655
All memory allocations are done. Calculating means ... 1629817883
6.166675403767268e-05, 1.6487168460770107, 0.9999539627671375
1629817894 ... Done
$ python3 test/pandas_performance.py
Starting ... 1629817895
All memory allocations are done. Calculating means ... 1629818105
2.8210330040178837e-05, 1.6487521860652903, 1.0000517246270497
1629818117 ... Done
$ Release/bin/dataframe_performance
Starting ... 1629818117
All memory allocations are done. Calculating means ... 1629818323
1, 1.64876, 1
1629818331 ... Done
$ Release/bin/dataframe_performance
Starting ... 1629818332
All memory allocations are done. Calculating means ... 1629818535
1, 1.64873, 1
1629818536 ... Done
$ Release/bin/dataframe_performance
Starting ... 1629818538
All memory allocations are done. Calculating means ... 1629818731
1, 1.64882, 1.00002
1629818733 ... Done
The pandas lib used in the performance section in README is a bit old. I made a few measurements and updated the README with the best run out of 3 consecutive runs (data below). Also I'm not sure if the point 4 still is valid. You could perhaps update the comment if applicable.
System config: OS: Ubuntu 20.04 CPU: CPU E5-2667 v2 RAM: 128GB GCC 10.3.0 pandas 1.3.2 numpy 1.21.2
Raw data: