Hello.
I have discovered a performance degradation in the .loc function of pandas version 2.0.3 when .loc handling big DataFrame with non-unique indexes. When using pandas more than 4 indexes, .loc drastically increases to X1000 times. And I notice that hi-ml-cpath/environment.yml, shows that it depends on pandas version 2.0.3. I am not sure whether this performance problem in pandas will affect this repository. I found some discussions on GitHub related to this issue, including #54550 and #54746.
I also found that hi-ml-cpath/other/slide_image_loading/src/Histopathology/datasets/panda_dataset.py and hi-ml-cpath/src/health_cpath/datasets/panda_tiles_dataset.py used the influenced api. There may be more files used the influenced api.
Suggestion
I would recommend considering an upgrade to a different version of pandas >= 2.1 or exploring other solutions to optimize the performance of .loc .
Any other workarounds or solutions would be greatly appreciated.
Thank you!
Issue Description:
Hello. I have discovered a performance degradation in the .loc function of pandas version 2.0.3 when .loc handling big DataFrame with non-unique indexes. When using pandas more than 4 indexes, .loc drastically increases to X1000 times. And I notice that
hi-ml-cpath/environment.yml
, shows that it depends on pandas version 2.0.3. I am not sure whether this performance problem in pandas will affect this repository. I found some discussions on GitHub related to this issue, including #54550 and #54746. I also found thathi-ml-cpath/other/slide_image_loading/src/Histopathology/datasets/panda_dataset.py
andhi-ml-cpath/src/health_cpath/datasets/panda_tiles_dataset.py
used the influenced api. There may be more files used the influenced api.Suggestion
I would recommend considering an upgrade to a different version of pandas >= 2.1 or exploring other solutions to optimize the performance of
.loc
. Any other workarounds or solutions would be greatly appreciated. Thank you!