nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
https://clades.nextstrain.org
MIT License
219 stars 61 forks source link

Add a BA.1 reference for the web nextclade version #1426

Open murallcl opened 9 months ago

murallcl commented 9 months ago

Hello!
We would like to run some datasets against a BA.1 reference using the web version of Nextclade.
Currently there are Wuhan, BA.2 and some of BA.2's offspring. Could a BA.1 reference dataset be added? Thank you!

ivan-aksamentov commented 9 months ago

Hi @murallcl,

Might be possible if our scientists still have some time and forces!

In the meantime, do you know that we now have a guide for people to create, use and share their own datasets: https://github.com/nextstrain/nextclade_data/blob/master/docs/README.md ?

And you can find the machinery used for preparing our SC2 datasets here: https://github.com/neherlab/nextclade_data_workflows

It would be great if you can help!

corneliusroemer commented 9 months ago

Hi @murallcl - would you be able to explain your use case for the BA.1 reference dataset? That would help us understand whether there might be another way to achieve your goal without requiring a new dataset.

I'm not sure we will add new datasets for clades that have died out. However, you could patch the existing Wuhan-Hu-1 dataset to make your own BA.1 reference dataset.

In addition, we've been thinking about allowing mutation reporting relative to arbitrary references in the future - but this feature is still some time out.

murallcl commented 9 months ago

Hello Ivan and Cornelius,
Thanks for your prompt responses. We're using it to run some retrospective analyses (i.e. we're comparing sets of sequences from the period of time when BA.1 was circulating) and this is also doubling as genomic epi training for non-bioinformaticians (so, not commandline users of nextclade). More generally though, I think a BA.1 reference (or an OG Omicron B.1.1.529 reference) may still be relevant for those investigating long-lasting chronic infections or zoonosis.

I tried adding the BA.1 fasta consensus sequence from your lineage library (https://github.com/corneliusroemer/pango-sequences) by dropping it into 'customize' the wuhan but it gave me an error when it came time to comparing it to the sequences. I assumed I was missing input data for the customization but I did not investigate further. I can appreciate this isn't high priority, but if there's any further suggestions on how to customize the wuhan with the BA.1 SNPs, I'd be happy to try it. Thanks! Carmen Lia

rneher commented 6 months ago

just replacing the reference without replacing the tree won't work since mutations in the tree are coded relative to the reference. but you could just use a dataset without a tree. this won't give you any clades. You could use this minimal dataset

https://github.com/nextstrain/nextclade_data/tree/master/docs/minimal-dataset

and replace the reference by BA.1 and the annotation by the one from the SARS-CoV-2 dataset. this then will call mutations relative to BA.1, but won't assign any lineages (it needs a tree for that)