Upgrade pandas to 2.2.x

Marigold commented 1 year ago

Upgrade to a newer version of Pandas, likely to include bug-fixes and more migration towards Arrow data types.

Pandas 2.2

See: release notes

Supports copy-on-write, which will soon be the default (enabled in advance with pd.options.mode.copy_on_write = True)
Supports faster pyarrow strings, which will soon be the default (enabled in advance with pd.options.future.infer_string = True)
We will start to get warnings for things that will be removed or deprecated in Pandas 3.0

Pandas 2.1

See release notes

Pandas 2.0

Pandas 2.0 uses arrow as a backend format and promises some performance improvements, though might be slower for some operations. One day we'll have to migrate anyway, but it's probably good idea to wait until 2.0 becomes mature enough and is adopted by majority of users.

(We had a request for pandas 2.0 in owid-catalog-py)

Marigold commented 1 year ago

Tried pandas 2.0 on etl data://meadow/ihme_gbd/2019/gbd_child_mortality and its performance is a bit disappointing. Current ETL with pandas 1.x.x takes 55s and pandas 2.0.1 takes 68s.

larsyencken commented 1 year ago

We would need Pandas 2.0.x if we wanted to address

1334

by using Apache Arrow types in repacking

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Marigold commented 5 months ago

I wanted to at least give it a try to see how much we would have to change. Work in progress is here https://github.com/owid/etl/pull/2468. So far no major problems, though some things are annoying (e.g. read_sql with connections)

We'd need to run Datadiff on all datasets to verify that there are not any side effects.

owid / etl