nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

Support pandas version 2 #169

Closed victorlin closed 3 days ago

victorlin commented 5 days ago

In the process of nextstrain/augur#1473, it was noticed that pandas dependency version constraint in this repo is preventing Nextstrain runtimes from resolving to pandas version 2.

https://github.com/nextstrain/fauna/blob/d632af090c19f2601d99223a6c23047858a82a15/requirements.txt#L4

It's possible that the code is already compatible with pandas v2, in which case the fix is as simple as changing that line to pandas >=1.1.5, <3 (also mentioned in https://github.com/nextstrain/augur/issues/1473#issuecomment-2475038728).

Here are some resources for the migration:

joverlee521 commented 4 days ago

Aaand of course this is tied to https://github.com/nextstrain/fauna/issues/113...

Trial run of our flu_upload with Python 3.10 runs into error:

ImportError: Pandas requires version '2.0.1' or newer of 'xlrd' (version '1.2.0' currently installed).
joverlee521 commented 4 days ago

I was able to install packages without any issues because xlrd is an optional dependency that is only checked with pandas[excel]. Installing with pandas[excel] >=1.1.5, <3 and xlrd >=1.0.0, ==1.* keeps pandas at v1.5.3. because the extra excel was only added in v2.

Upgrading xlrd to v2 shouldn't be an issue for sequence uploads since GISAID exports .xls files. However, titer data come as .xlsx files, which does require xlrd v1 with Python 3.7 We can make this version difference explicit with environment markers:

diff --git a/requirements.txt b/requirements.txt
index 5694eed..5964ad0 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,8 +1,10 @@
 biopython >=1.73, ==1.*
 boto >=2.38, ==2.*
 certifi
-pandas >=1.1.5, ==1.*
+pandas[excel] >=2.0.0, <3; python_version>="3.8"
 rethinkdb >=2.4.8, ==2.4.*, !=2.4.10
 requests >=2.20.0, ==2.*
 unidecode >=1.0.22, ==1.*
-xlrd >=1.0.0, ==1.*
+# Use Python 3.7 for tbd uploading scripts
+pandas >=1.1.5, ==1.*; python_version<"3.8"
+xlrd >=1.0.0, ==1.*; python_version<"3.8"

Depending on how many issues we run into with replacing use of xlrd with excelrd (as suggested in https://github.com/nextstrain/fauna/issues/113#issuecomment-1040878351), we can also change the dependencies to

diff --git a/requirements.txt b/requirements.txt
index 5694eed..7cc1d7d 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,8 +1,8 @@
 biopython >=1.73, ==1.*
 boto >=2.38, ==2.*
 certifi
-pandas >=1.1.5, ==1.*
+pandas[excel] >=2.0.0, <3
 rethinkdb >=2.4.8, ==2.4.*, !=2.4.10
 requests >=2.20.0, ==2.*
 unidecode >=1.0.22, ==1.*
-xlrd >=1.0.0, ==1.*
+excelrd >=3.0.0, <4
victorlin commented 4 days ago

Thanks for digging into this @joverlee521! Stepping back a bit, a couple points:

joverlee521 commented 4 days ago

The Nextstrain runtimes don't need to have pandas v2 any time soon.

Gotcha! Didn't want fauna to be a blocker here...

Is it worth considering decoupling Fauna from the Nextstrain runtimes (e.g. using an extended version of the base image when working with Fauna)?

Yes! It might makes sense to move it out of base image since it is only used by seasonal-flu and avian-flu. It should be straightforward to point the automated GH Action workflows to use the extended image. However, it might be extra burden on users to remember to use the correct image when running things manually.

victorlin commented 4 days ago

Decoupling seems like the better option, I will write up a separate issue and link it soon.

Feel free to close this one as not planned.