owid / co2-data

Data on CO2 and greenhouse gas emissions by Our World in Data
https://ourworldindata.org/co2-and-other-greenhouse-gas-emissions
655 stars 219 forks source link

Potential performance issue: concat slow in pandas below 2.1 version #43

Closed TendouArisu closed 6 months ago

TendouArisu commented 8 months ago

Issue Description:

Hello. I have discovered a performance degradation in the .concat function of pandas version 1.5.2. And I notice the repository depends on pandas 1.5.2 in scripts/requirements.txt. I am not sure whether this performance problem in pandas will affect this repository. I found some discussions on pandas GitHub related to this issue, including #50652 and #52685. I also found that scripts/make_dataset.py used the influenced api. There may be more files using the influenced api.

Suggestion

I would recommend considering an upgrade to a different version of pandas >= 2.1 or exploring other solutions to optimize the performance of .concat. Any other workarounds or solutions would be greatly appreciated. Thank you!

pabloarosado commented 6 months ago

Hi @TendouArisu thank you very much for noticing that. This repos is only used to update the CO2 dataset, which happens a few times per year. Most of our processing is done in a different repository, called ETL, where performance is more important (and where we currently use pandas 2.2.1). We may upgrade pandas here too at some point.