harsha-simhadri / big-ann-benchmarks

Framework for evaluating ANNS algorithms on billion scale datasets.
https://big-ann-benchmarks.com
MIT License
313 stars 103 forks source link

Potential performance Issue: Slow read_csv() Function with pandas 2.0.0 #281

Open TendouArisu opened 4 months ago

TendouArisu commented 4 months ago

Issue Description:

Hello. I have discovered a performance degradation in the read_csv function of pandas version below 2.0.1. And I notice some parts of the repository depend on pandas 2.0.0 in requirements_py3.10.txt and some other dependencies require pandas below 2.0.1. I am not sure whether this performance problem in pandas will affect this repository. I found some discussions on pandas GitHub related to this issue, including #52546 and #52548. I also found that eval/show_operating_points.py used the influenced api. There may be more files using the influenced api.

Suggestion

I would recommend considering an upgrade to a different version of pandas >= 2.0.1 or exploring other solutions to optimize the performance of read_csv. Any other workarounds or solutions would be greatly appreciated. Thank you!