Closed joverlee521 closed 8 months ago
Tested locally with fresh install
$ pyenv virtualenv new-fauna
$ pyenv activate new-fauna
$ pip install -r requirements.txt
...
Successfully installed biopython-1.81 boto-2.49.0 certifi-2023.11.17 charset-normalizer-3.3.2 idna-3.4 numpy-1.21.6 pandas-1.3.5 python-dateutil-2.8.2 pytz-2023.3.post1 requests-2.31.0 rethinkdb-2.4.9 six-1.16.0 unidecode-1.3.7 urllib3-2.0.7 xlrd-1.2.0
Successfully ran download in new env
$ envdir ../env.d/seasonal-flu/ python3 ../fauna/vdb/download.py --database vdb --virus flu --fasta_fields strain virus locus accession collection_date submission_date region country division location passage_category originating_lab submitting_lab age gender --resolve_method split_passage --select locus:ha lineage:seasonal_yam --path data --fstem yam_ha
Connected to the "vdb" database
Downloading documents from the sequence table "flu_sequences" (n=1501466) & virus table "flu_viruses" (n=278243)
Only downloading documents with field 'locus' equal to one of ['ha']
Only downloading documents with field 'lineage' equal to one of ['seasonal_yam']
Downloaded 25602 sequences
Resolving duplicate strains by keeping one cell/direct and one egg sequence
Appends -egg to egg-passaged sequence
Within cell/egg partitions prioritize longest sequence
Outputing 22471 documents to data/yam_ha.fasta
Wrote to data/yam_ha.fasta
--- 3.539134347438812 minutes to download ---
Running docker-base CI to install the latest fauna w/ requirements to be able to run the seasonal flu builds.
rethinkdb v2.4.10 is broken because it did not include the newly added
looseversion
package to their setup.py.https://github.com/rethinkdb/rethinkdb-python/issues/309