nextstrain / avian-flu

Nextstrain build for avian influenza viruses
http://nextstrain.org/avian-flu
13 stars 6 forks source link

ingest: Run Nextclade as part of ingest #44

Open joverlee521 opened 4 weeks ago

joverlee521 commented 4 weeks ago

Follow up to #40

With the recent addition of the community H5 Nextclade datasets in https://github.com/nextstrain/nextclade_data/pull/196, it should now be possible to run Nextclade as part of ingest to assign clades to the H5 sequences.

Maybe this can replace the current manual clade labeling process with clade-labeling scripts?

joverlee521 commented 1 week ago

I'm starting with the community/moncla-lab/iav-h5/ha/all-clades Nextclade dataset since that should work across fauna and NCBI sequences. Tested manually on ingest-with-nextclade branch.

NCBI

nextstrain build \
    ingest \
        joined-ncbi/results/nextclade.tsv \
        --configfile build-configs/ncbi/defaults/config.yaml

Almost everything gets assigned to the expected 2.3.4.4b clade, except 3 sequences were assigned to the 0 clade:

Fauna

nextstrain build \
    --envdir ../env.d/seasonal-flu/ \
    ingest \
        fauna/results/nextclade.tsv \
        --configfile build-configs/ncbi/defaults/config.yaml

Since this is all avian flu and not just H5, there's ~30% not assigned to any clade.

See detailed breakdown of counts |clade |count |percent | |------------|------:|--------:| | |13371 |30\.65 | |0 |598 |1\.37 | |1 |630 |1\.44 | |1\.1 |127 |0\.29 | |1\.1\.1 |76 |0\.17 | |1\.1\.2 |224 |0\.51 | |2\.1\.1 |80 |0\.18 | |2\.1\.2 |79 |0\.18 | |2\.1\.3 |55 |0\.13 | |2\.1\.3\.1 |36 |0\.08 | |2\.1\.3\.2 |429 |0\.98 | |2\.1\.3\.2a |129 |0\.30 | |2\.1\.3\.2b |133 |0\.30 | |2\.1\.3\.3 |51 |0\.12 | |2\.2 |817 |1\.87 | |2\.2\.1 |555 |1\.27 | |2\.2\.1\.1 |140 |0\.32 | |2\.2\.1\.1a |117 |0\.27 | |2\.2\.1\.2 |645 |1\.48 | |2\.2\.2 |184 |0\.42 | |2\.2\.2\.1 |72 |0\.17 | |2\.3\.1 |21 |0\.05 | |2\.3\.2 |100 |0\.23 | |2\.3\.2\.1 |217 |0\.50 | |2\.3\.2\.1a |932 |2\.14 | |2\.3\.2\.1b |92 |0\.21 | |2\.3\.2\.1c |211 |0\.48 | |2\.3\.2\.1d |96 |0\.22 | |2\.3\.2\.1e |793 |1\.82 | |2\.3\.2\.1f |559 |1\.28 | |2\.3\.2\.1g |384 |0\.88 | |2\.3\.3 |28 |0\.06 | |2\.3\.4 |574 |1\.32 | |2\.3\.4\.1 |58 |0\.13 | |2\.3\.4\.2 |86 |0\.20 | |2\.3\.4\.3 |206 |0\.47 | |2\.3\.4\.4 |307 |0\.70 | |2\.3\.4\.4a |246 |0\.56 | |2\.3\.4\.4b |14625 |33\.53 | |2\.3\.4\.4c |1268 |2\.91 | |2\.3\.4\.4d |134 |0\.31 | |2\.3\.4\.4e |718 |1\.65 | |2\.3\.4\.4f |173 |0\.40 | |2\.3\.4\.4g |221 |0\.51 | |2\.3\.4\.4h |736 |1\.69 | |2\.4 |6 |0\.01 | |2\.5 |18 |0\.04 | |3 |7 |0\.02 | |4 |26 |0\.06 | |5 |16 |0\.04 | |6 |11 |0\.03 | |7 |69 |0\.16 | |7\.1 |20 |0\.05 | |7\.2 |60 |0\.14 | |8 |4 |0\.01 | |9 |22 |0\.05 | |Am\-nonGsGD |1263 |2\.90 | |EA\-nonGsGD |769 |1\.76 |

I'm going to join with metadata ~tomorrow~ Thursday and cross check the clades with the existing clades from fauna.

joverlee521 commented 1 week ago

Latest push to the ingest-with-nextclade branch now joins the metadata with the Nextclade output.

I did a brief look into the fauna side to compare Nextclade clades with the existing clades

Of the 43,642 records