Add influenza A H5N1 dataset

anna-parker commented 2 months ago

Copy of the nextclade dataset created by @chaoran-chen in https://github.com/GenSpectrum/nextclade-datasets/tree/main/data/flu/h5n1.

ivan-aksamentov commented 2 months ago

Hi! Oh cool, new datasets! :)

Sorry I am not in the loop of these new developments. I'll let Richard and Cornelius to review.

But just want to bring up a few technical/bureaucratic issues:

Is it different from https://github.com/nextstrain/nextclade_data/tree/master/data/community/moncla-lab/iav-h5/ha ?

Is anyone from Nextstrain involved in development to place it into the "nextstrain" collection and not into "community"?

anna-parker commented 2 months ago

Hi Ivan! The main difference is this contains references for all 8 segments.

About people from nextstrain being involved... I guess not really - should I move this into community under genspectrum?

anna-parker commented 2 months ago

Thanks for the quick review - I will update with the requested changes tomorrow!

chaoran-chen commented 2 months ago

As far as I understood, PB2, PB1, etc. are the names of genes but not really the names of the segments, and some segments have multiple genes. Also, I chose them because, for GenSpectrum/LAPIS, it is better to avoid using the same names for nucleotide and amino acid sequences. Having the same names would make filtering for mutations more difficult because it would be unclear whether HA:123G refers to a nucleotide or amino acid mutation.

This being said, this is independent of the Nextclade datasets, so we can, of course, rename them here (and just name them differently when importing into LAPIS)

anna-parker commented 2 months ago

@corneliusroemer would it be ok to keep the segments as seg1 as we have now moved to a community folder?

corneliusroemer commented 2 months ago

Community still shows in Nextclade by default so we should make sure Readme etc are meaningful. I'll review properly.

You can use the dataset with Nextclade even without it being merged - just need to point it at the right repo/branch/path.

I still think paths should be as obvious as possible and using segment is inconsistent with usage in the flu community and also with other Nextclade datasets for flu. Why does the path matter so much? For Genspectrum if you want to avoid clash of CDS names with segment names, you could just prefix segment names in queries with seg, or nuc, you already do that just with numbers 1-8 rather than the more commonly used gene based names.

Also, Nextclade paths are just paths, you could decide to call the segments whatever you want and just map from the path to the segment name, if you want those to be different.

chaoran-chen commented 2 months ago

But are HA, NA really the correct and firmly-established names in the community for the segments? NCBI virus shows the numbers in the segment column:

If we look at the sequence names, it's often a mix. For the NCBI RefSeq of H5N1, this sequence only contains "HA", this sequence only contains "segment 7" (not "M"), and this sequence contains both "segment 1" and "PB 2".

As said, for Nextclade, I am happy (and agree that it makes sense) to follow the Nextstrain conventions. For GenSpectrum, the evidence that I found so far indicates that segments 1-8 are accepted (and actually correct) names for the nucleotide sequences and that we should use them. (But I can be convinced otherwise if an influenza expert (e.g. @rneher) believes that this doesn't make sense.)

rneher commented 1 month ago

I do think that PB2, PB1, PA etc are more common segment names than the numbers and we currently use the names rather than numbers across nextclade and nextstrain. So clearly both nomenclatures exist.

The Moncla lab datasets for all of H5Nx uses the same reference of HA (Goose/Guangdong). For others they use more recent sequences.

One thing that might things a little more difficult for you is that there are also quite a few strains have the H5 HA but that reassorted and use sequences very dissimilar form the Goose/Guangdong sequence in other segments.

rneher commented 1 month ago

That said, using the the GG/1996 sequence is probably still useful. But thinking about the name space might be important.

iav for Influenza A virus could be useful (sort of different for humans, since we have a few defined lineages of A and B circulating, but in animals it is a diverse mix of A). If you want to restrict yourself to viruses from the GG/1996 lineages, then maybe a name iav/h5n1/GG1996/pb1 etc could work.

anna-parker commented 1 month ago

Thanks so much for the comments! I will try to add them in later today!

It would be great though if we could not touch this branch for the next 5hours as we are using it in a demo - thanks again for all the help!

nextstrain / nextclade_data

Add influenza A H5N1 dataset #217