hosseinmoein / DataFrame

C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
https://hosseinmoein.github.io/DataFrame/
BSD 3-Clause "New" or "Revised" License
2.54k stars 313 forks source link

Update perf section in README.md #125

Closed jmakov closed 3 years ago

jmakov commented 3 years ago

The pandas lib used in the performance section in README is a bit old. I made a few measurements and updated the README with the best run out of 3 consecutive runs (data below). Also I'm not sure if the point 4 still is valid. You could perhaps update the comment if applicable.

System config: OS: Ubuntu 20.04 CPU: CPU E5-2667 v2 RAM: 128GB GCC 10.3.0 pandas 1.3.2 numpy 1.21.2

Raw data:

$time _build/bin/dataframe_performance                                                                                                                                                                                                         
All memory allocations are done. Calculating means ...                                                                                                                                                                                                                                    
1, 1.64872, 0.999962                                                                                                                                                                                                                                                                      

real    3m42.888s                                                                                                                                                                                                                                                                         
user    3m13.994s                                                                                                                                                                                                                                                                         
sys     0m36.339s       

$time _build/bin/dataframe_performance                                                                                                                                                                                                         
All memory allocations are done. Calculating means ...                                                                                                                                                                                                                                    
1, 1.64877, 0.999963                                                                                                                                                                                                                                                                      

real    3m34.241s                                                                                                                                                                                                                                                                         
user    3m14.250s                                                                                                                                                                                                                                                                         
sys     0m25.983s                    

$time _build/bin/dataframe_performance                                                                                                                                                                                                         
All memory allocations are done. Calculating means ...                                                                                                                                                                                                                                    
1, 1.64872, 1.00003                                                                                                                                                                                                                                                                       

real    3m36.994s                                                                                                                                                                                                                                                                         
user    3m16.770s                                                                                                                                                                                                                                                                         
sys     0m28.976s      

$time _build/bin/dataframe_performance_2                                                                                                                                                                                                       
Starting ... 1629741602                                                                                                                                                                                                                                                                   
All memory allocations are done. Calculating means ... 1629741736                                                                                                                                                                                                                         
-0.253701, 1.38843, 0.510716                                                                                                                                                                                                                                                              
1629741749 ... Done                                                                                                                                                                                                                                                                       

real    2m28.666s                                                                                                                                                                                                                                                                         
user    2m2.227s                                                                                                                                                                                                                                                                          
sys     0m39.015s      

$time _build/bin/dataframe_performance_2                                                                                                                                                                                                       
Starting ... 1629741751                                                                                                                                                                                                                                                                   
All memory allocations are done. Calculating means ... 1629741885                                                                                                                                                                                                                         
0.533292, 2.65255, 1.56222                                                                                                                                                                                                                                                                
1629741895 ... Done                                                                                                                                                                                                                                                                       

real    2m26.359s                                                                                                                                                                                                                                                                         
user    2m2.624s                                                                                                                                                                                                                                                                          
sys     0m36.570s  

$time _build/bin/dataframe_performance_2                                                                                                                                                                                                       
Starting ... 1629741897                                                                                                                                                                                                                                                                   
All memory allocations are done. Calculating means ... 1629742025                                                                                                                                                                                                                         
0.477141, 2.10208, 1.6053                                                                                                                                                                                                                                                                 
1629742031 ... Done                                                                                                                                                                                                                                                                       

real    2m15.882s                                                                                                                                                                                                                                                                         
user    2m0.258s                                                                                                                                                                                                                                                                          
sys     0m27.671s    

$time python3 test/pandas_performance.py 
All memory allocations are done. Calculating means ...
-9.988731426750974e-06, 1.6486985707329185, 1.000038273533297

real    5m51.598s
user    3m3.485s
sys     1m26.292s

$time python3 test/pandas_performance.py 
All memory allocations are done. Calculating means ...
1.2071609111906213e-05, 1.6486945894947962, 1.000033309466849

real    6m8.130s
user    3m7.725s
sys     1m16.464s

$time python3 test/pandas_performance.py 
All memory allocations are done. Calculating means ...
2.6262100931969532e-05, 1.6488067975370642, 0.9999602129977767

real    6m8.272s
user    3m6.618s
sys     1m15.380s

$time python3 test/pandas_performance_2.py 
Starting ... 1629745011
All memory allocations are done. Calculating means ... 1629745153
-0.6721524139164776, 0.4607973248383193, 2.261591932580144
1629745302 .. Done

real    4m51.379s
user    3m55.702s
sys     0m56.192s

$time python3 test/pandas_performance_2.py 
Starting ... 1629745302
All memory allocations are done. Calculating means ... 1629745443
0.3304210011701104, 0.6970202347018191, 1.4390334541989027
1629745591 .. Done

real    4m49.716s
user    3m53.747s
sys     0m56.071s

$time python3 test/pandas_performance_2.py 
Starting ... 1629745592
All memory allocations are done. Calculating means ... 1629745758
-0.7551205808516117, 1.2796425714614599, 0.9864457281940499
1629745911 .. Done

real    5m19.508s
user    4m3.259s
sys     1m16.855s
jmakov commented 3 years ago

Also, was DataFrame compiled with -O3 flag?

Thanks

I followed the build instructions in the README e.g. cmake -DCMAKE_BUILD_TYPE=Release .. && make -j 32 without any other settings/modifications. Hope this answers your question.

jmakov commented 3 years ago

Thanks for the PR. Can you please update points 4 and 5 according to your latest data? Point 4 is actually very important, because it measures the processing of data and how much difference the memory layout of two libraries make.

Best HM

Point 5 still stands (sorry for not mentioning it before). Regarding point 4, I'm not sure how you got to the 21x factor in the first place, that's why I asked if you could also share some thoughts on these numbers about that :).

hosseinmoein commented 3 years ago

Point 5 still stands (sorry for not mentioning it before). Regarding point 4, I'm not sure how you got to the 21x factor in the first place, that's why I asked if you could also share some thoughts on these numbers about that :).

I used to have a few more lines of code in dataframe_performance.cc/pandas_performance.py that printed the time at the beginning of the program, after loading the data (before calculating means), and at the end. That's identical to what I currently have in dataframe_performace_2.cc/pandas_performance_2.py. That way I knew exactly how long the data loading took as compared to calculating means. You could add those lines back from the xxxx_2 tests to the original tests.

I really don't want to compare the random number generation + data loading part, because both DataFrame and Pandas are using the same underlying libraries. The value added by DataFrame is how it lays out the data and how it iterates through it.

jmakov commented 3 years ago

Thank you very much for that insight (didn't know you're timing stuff in those files). I've added another commit, let me know if you have any other thoughts.

hosseinmoein commented 3 years ago

Thank you very much for that insight (didn't know you're timing stuff in those files). I've added another commit, let me know if you have any other thoughts.

Sorry I wasn't clear on my explanation. I didn't mean for you to run and print stats for dataframe_performace_2.cc/pandas_performance_2.py (you don't need to run dataframe_performace_2.cc/pandas_performance_2.py at all). I meant to add the timing printouts to dataframe_performace.cc/pandas_performance.py source codes and then run them. But I already added the timings. Please pull from master, make, and run dataframe_performace.cc/pandas_performance.py on your environment. Now you can use the additional timing printouts info to adjust point 4. Point number 4 compares only the time it takes to calculate means.

Thanks, HM

jmakov commented 3 years ago

You were clear. I just compared quickly the dataframe_performance.cc and dataframe_performance_2.cc, looked like you're doing a similar thing so I just ran the _2 files. In the new measurements the 1 second resolution is perhaps a bit problematic - best run from DataFrame is 1s, best run from Pandas 11s. Do you want to pursue this further (and we work on e.g. ms timings)? I've updated the README if you decide to be done with it.

Raw data:

python3 test/pandas_performance.py
Starting ... 1629817412
All memory allocations are done. Calculating means ... 1629817642
8.734577986843534e-06, 1.6487959965745378, 0.9999620449662165
1629817654 ... Done

$ python3 test/pandas_performance.py
Starting ... 1629817655
All memory allocations are done. Calculating means ... 1629817883
6.166675403767268e-05, 1.6487168460770107, 0.9999539627671375
1629817894 ... Done

$ python3 test/pandas_performance.py
Starting ... 1629817895
All memory allocations are done. Calculating means ... 1629818105
2.8210330040178837e-05, 1.6487521860652903, 1.0000517246270497
1629818117 ... Done

$ Release/bin/dataframe_performance
Starting ... 1629818117
All memory allocations are done. Calculating means ... 1629818323
1, 1.64876, 1
1629818331 ... Done

$ Release/bin/dataframe_performance
Starting ... 1629818332
All memory allocations are done. Calculating means ... 1629818535
1, 1.64873, 1
1629818536 ... Done

$ Release/bin/dataframe_performance
Starting ... 1629818538
All memory allocations are done. Calculating means ... 1629818731
1, 1.64882, 1.00002
1629818733 ... Done