owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
75 stars 20 forks source link

Upgrade pandas to 2.2.x #1094

Closed Marigold closed 5 months ago

Marigold commented 1 year ago

Upgrade to a newer version of Pandas, likely to include bug-fixes and more migration towards Arrow data types.

Pandas 2.2

See: release notes

Pandas 2.1

Pandas 2.0

Pandas 2.0 uses arrow as a backend format and promises some performance improvements, though might be slower for some operations. One day we'll have to migrate anyway, but it's probably good idea to wait until 2.0 becomes mature enough and is adopted by majority of users.

(We had a request for pandas 2.0 in owid-catalog-py)

Marigold commented 1 year ago

Tried pandas 2.0 on etl data://meadow/ihme_gbd/2019/gbd_child_mortality and its performance is a bit disappointing. Current ETL with pandas 1.x.x takes 55s and pandas 2.0.1 takes 68s.

larsyencken commented 1 year ago

We would need Pandas 2.0.x if we wanted to address

by using Apache Arrow types in repacking

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Marigold commented 5 months ago

I wanted to at least give it a try to see how much we would have to change. Work in progress is here https://github.com/owid/etl/pull/2468. So far no major problems, though some things are annoying (e.g. read_sql with connections)

We'd need to run Datadiff on all datasets to verify that there are not any side effects.