nextstrain / dengue

Nextstrain build for dengue virus
https://nextstrain.org/dengue
8 stars 10 forks source link

Rename "subtype" to "genotype" #41

Closed j23414 closed 4 months ago

j23414 commented 5 months ago

Context

In response to comment:

I think we should generally be consistent with the nomenclature. I see for metadata you have nextclade_subtype with entries like DENV1/II. This is canonically "DENV genotype". I suggest aiming for two columns in the metadata. One for serotype with DENV1, DENV2, etc... and one for denv_genotype with DENV1/II, etc.... This is similar to how things work for SARS-CoV-2 with a clade column as well as a lineage column. Also mpox uses clade and lineage as well as separate columns.

Description

Currently we have

Of course feel free to comment on this GitHub Issue with other suggestions. Optionally, we could reorder the metadata columns such that ncbi_serotype and nextclade_genotype are next to each other to make this distinction more obvious to people manually looking at the metadata file.

j23414 commented 4 months ago

Following from https://github.com/nextstrain/dengue/pull/48#discussion_r1598828618, expanding the scope of this issue to establish standard metadata column names pertaining to Dengue serotype, genotype, and the various methods we employ to derive them for a specific strain (NCBI, Nextclade, augur clades (Nextstrain)).

Below is a proposed standardization along with suggested modifications:

Metadata Column Auspice Display Title Description of data
ncbi_serotype -> ~serotype_ncbi~ -> serotype_genbank NCBI serotype -> ~Serotype (NCBI)~ -> Serotype (Genbank metadata) Indicates that the assignment of denv1-4 is based on NCBI GenBank record annotation.
clade_membership during the "all_genome build" Serotype -> Serotype (Nextstrain) Indicates that the assignment of denv1-4 is based on augur clades call using full-genome-level-serotype-defining amino acid mutations.
nextclade_subtype -> genotype_nextclade Nextclade genotype -> ~Genotype (Nextclade)~ -> Dengue Genotype (Nextclade) Denotes genotype level assignment (e.g., DENV1/S) within serotype, based on Nextclade call.
clade_membership during the "denvX_genome builds" DENV genotype -> ~Genotype (Nextstrain)~ -> Dengue Genotype (Nextstrain) Denotes genotype level assignment (e.g., DENV1/S) within serotype, based on augur clades call using full-genome-level-genotype-defining amino acid mutations.

Feel free to suggest other naming conventions along with written justification. This also leaves room for the potential inclusion of ~Genotype (NCBI)~ Dengue Genotype (Genbank metadata) if a script is developed to parse genotypes from GenBank data.

kimandrews commented 4 months ago

Thanks for this very organized summary! Based on comments here, we may want to use Serotype (GenBank metadata) rather than Serotype (NCBI) and change Genotype to Dengue Genotype (I'm planning to make similar changes for the live measles tree eventually). I think distinguishing between Genotype (Nextclade) and Genotype (Nextstrain) could be confusing, but I'm not sure if there is a better solution if we need to include output from both of these analyses on the same tree. Also I prefer Clade over Genotype, but it's probably best to use Genotype if that is what is used in the Dengue literature.

j23414 commented 4 months ago

Thanks for linking some more recent discussion on naming! I'm open to changing NCBI to GenBank metadata (e.g. Serotype (NCBI) -> Serotype (GenBank metadata). Clarification question, are you also planning to update the metadata column names? For example changing genotype_ncbi to genotype_genbank during ingest here?

I ran a quick PubMed search and it looks like the dengue literature uses Genotype. I've linked the results below but good to check!

kimandrews commented 4 months ago

It's probably a good idea to change from genotype_ncbi to genotype_genbank in the measles repo as you suggested. I don't have strong opinions about these changes.

j23414 commented 4 months ago

Thanks for clarifying as an optional path, I went ahead and updated the table above accordingly