Open jurresiegers opened 1 week ago
Hi @jurresiegers
There's been some discussion in this topic: https://github.com/nextstrain/nextclade/issues/870#issuecomment-2457224682
This could change, but currently I am not aware of any concrete plans on Nextstrain team to prepare datasets on this particular topic.
Community contributions are very welcome! Dataset author documentation is here: https://github.com/nextstrain/nextclade_data
Thanks Ivan! I will follow up on that topic :)
@jurresiegers I think it's better to continue here, because that issue was for a different reason and also it is closed. I'll invite people from there to here.
@ivan-aksamentov and @jurresiegers I've now got a working dataset but have only been able to test on around 300 sequences. Is there any guidance on what's considered adequate testing before submitting datasets?
@AMPByrne There are no particular established criteria - every virus is different.
You could submit a pull request to the data repo, and also give a link to your source repo, where you prepare the dataset, so that other people could test the dataset(s) as well. And then the community can decide if it's any good. And if not, they could suggest improvements. They could also comment in your source repo and submit proposals or fixes there.
The usual points which are discussed in these situations are the choice of reference sequence, sampling of the sequences for reference tree, QC config, how to subdivide datasets if there are multiple distant strains, dataset (path) naming etc.
@AMPByrne this generally sounds great, and am happy for you to take the lead on the H9 dataset if you are so inclined, and are already doing it! We are about to put a manuscript describing our approach for the H5 datasets on bioRxiV, but would be happy to share it with you via email if you'd like to see what we did. We found that the clade calls tend to be better with more data, so I'd suggest maximizing the number of sequences you include. We also wanted to make sure that we were assigning things according to the established clades by WHO/FAO/WOAH, so we acquired a reference set from them, identified clade-defining nodes, and then tested performance of NextClade calls against LABEL using all H5 data that we maintain for Nextstrain purposes (which was about 20,000 sequences that were not in the reference set). That was our general approach, and we plan to maintain these H5 ones and continually work to improve them and keep them up to date. Generally are happy to help/collaborate with you, though our current bandwidth is a bit limited, so we may not be able to directly work on this in the next couple of months.
Hi all,
Would it be possible to get a H9Nx Nextclade build up and running based on the recently published H9 nomenclature paper? This paper included reference datasets (see Appendix 2 and 3) from GISAID/NCBI for all lineages and specific sub lineages.
https://wwwnc.cdc.gov/eid/article/30/8/23-1176_article
Best, Jurre Siegers