Closed Marigold closed 5 months ago
Tried pandas 2.0 on etl data://meadow/ihme_gbd/2019/gbd_child_mortality
and its performance is a bit disappointing. Current ETL with pandas 1.x.x takes 55s and pandas 2.0.1 takes 68s.
We would need Pandas 2.0.x if we wanted to address
by using Apache Arrow types in repacking
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I wanted to at least give it a try to see how much we would have to change. Work in progress is here https://github.com/owid/etl/pull/2468. So far no major problems, though some things are annoying (e.g. read_sql
with connections)
We'd need to run Datadiff on all datasets to verify that there are not any side effects.
Upgrade to a newer version of Pandas, likely to include bug-fixes and more migration towards Arrow data types.
Pandas 2.2
See: release notes
pd.options.mode.copy_on_write = True
)pd.options.future.infer_string = True
)Pandas 2.1
Pandas 2.0
Pandas 2.0 uses arrow as a backend format and promises some performance improvements, though might be slower for some operations. One day we'll have to migrate anyway, but it's probably good idea to wait until 2.0 becomes mature enough and is adopted by majority of users.
(We had a request for pandas 2.0 in owid-catalog-py)