nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
https://clades.nextstrain.org
MIT License
219 stars 61 forks source link

H9Nx (all lineages), H9Nx (Y-lineage), H9Nx (G-lineage), H9Nx (B-lineage) #1552

Open jurresiegers opened 1 week ago

jurresiegers commented 1 week ago

Hi all,

Would it be possible to get a H9Nx Nextclade build up and running based on the recently published H9 nomenclature paper? This paper included reference datasets (see Appendix 2 and 3) from GISAID/NCBI for all lineages and specific sub lineages.

https://wwwnc.cdc.gov/eid/article/30/8/23-1176_article

Best, Jurre Siegers

ivan-aksamentov commented 1 week ago

Hi @jurresiegers

There's been some discussion in this topic: https://github.com/nextstrain/nextclade/issues/870#issuecomment-2457224682

This could change, but currently I am not aware of any concrete plans on Nextstrain team to prepare datasets on this particular topic.

Community contributions are very welcome! Dataset author documentation is here: https://github.com/nextstrain/nextclade_data

jurresiegers commented 1 week ago

Thanks Ivan! I will follow up on that topic :)

ivan-aksamentov commented 1 week ago

@jurresiegers I think it's better to continue here, because that issue was for a different reason and also it is closed. I'll invite people from there to here.

AMPByrne commented 1 week ago

@ivan-aksamentov and @jurresiegers I've now got a working dataset but have only been able to test on around 300 sequences. Is there any guidance on what's considered adequate testing before submitting datasets?

ivan-aksamentov commented 1 week ago

@AMPByrne There are no particular established criteria - every virus is different.

You could submit a pull request to the data repo, and also give a link to your source repo, where you prepare the dataset, so that other people could test the dataset(s) as well. And then the community can decide if it's any good. And if not, they could suggest improvements. They could also comment in your source repo and submit proposals or fixes there.

The usual points which are discussed in these situations are the choice of reference sequence, sampling of the sequences for reference tree, QC config, how to subdivide datasets if there are multiple distant strains, dataset (path) naming etc.

lmoncla commented 5 days ago

@AMPByrne this generally sounds great, and am happy for you to take the lead on the H9 dataset if you are so inclined, and are already doing it! We are about to put a manuscript describing our approach for the H5 datasets on bioRxiV, but would be happy to share it with you via email if you'd like to see what we did. We found that the clade calls tend to be better with more data, so I'd suggest maximizing the number of sequences you include. We also wanted to make sure that we were assigning things according to the established clades by WHO/FAO/WOAH, so we acquired a reference set from them, identified clade-defining nodes, and then tested performance of NextClade calls against LABEL using all H5 data that we maintain for Nextstrain purposes (which was about 20,000 sequences that were not in the reference set). That was our general approach, and we plan to maintain these H5 ones and continually work to improve them and keep them up to date. Generally are happy to help/collaborate with you, though our current bandwidth is a bit limited, so we may not be able to directly work on this in the next couple of months.