Closed j23414 closed 4 months ago
Following from https://github.com/nextstrain/dengue/pull/48#discussion_r1598828618, expanding the scope of this issue to establish standard metadata column names pertaining to Dengue serotype, genotype, and the various methods we employ to derive them for a specific strain (NCBI, Nextclade, augur clades (Nextstrain)).
Below is a proposed standardization along with suggested modifications:
Metadata Column | Auspice Display Title | Description of data |
---|---|---|
ncbi_serotype -> ~serotype_ncbi ~ -> serotype_genbank |
NCBI serotype -> ~Serotype (NCBI) ~ -> Serotype (Genbank metadata) |
Indicates that the assignment of denv1-4 is based on NCBI GenBank record annotation. |
clade_membership during the "all_genome build" |
Serotype -> Serotype (Nextstrain) |
Indicates that the assignment of denv1-4 is based on augur clades call using full-genome-level-serotype-defining amino acid mutations. |
nextclade_subtype -> genotype_nextclade |
Nextclade genotype -> ~Genotype (Nextclade) ~ -> Dengue Genotype (Nextclade) |
Denotes genotype level assignment (e.g., DENV1/S ) within serotype, based on Nextclade call. |
clade_membership during the "denvX_genome builds" |
DENV genotype -> ~Genotype (Nextstrain) ~ -> Dengue Genotype (Nextstrain) |
Denotes genotype level assignment (e.g., DENV1/S ) within serotype, based on augur clades call using full-genome-level-genotype-defining amino acid mutations. |
Feel free to suggest other naming conventions along with written justification. This also leaves room for the potential inclusion of ~Genotype (NCBI)
~ Dengue Genotype (Genbank metadata)
if a script is developed to parse genotypes from GenBank data.
Thanks for this very organized summary! Based on comments here, we may want to use Serotype (GenBank metadata)
rather than Serotype (NCBI)
and change Genotype
to Dengue Genotype
(I'm planning to make similar changes for the live measles tree eventually). I think distinguishing between Genotype (Nextclade)
and Genotype (Nextstrain)
could be confusing, but I'm not sure if there is a better solution if we need to include output from both of these analyses on the same tree. Also I prefer Clade
over Genotype
, but it's probably best to use Genotype
if that is what is used in the Dengue literature.
Thanks for linking some more recent discussion on naming! I'm open to changing NCBI
to GenBank metadata
(e.g. Serotype (NCBI)
-> Serotype (GenBank metadata)
. Clarification question, are you also planning to update the metadata column names? For example changing genotype_ncbi
to genotype_genbank
during ingest here?
I ran a quick PubMed search and it looks like the dengue literature uses Genotype
. I've linked the results below but good to check!
It's probably a good idea to change from genotype_ncbi
to genotype_genbank
in the measles repo as you suggested. I don't have strong opinions about these changes.
Thanks for clarifying as an optional path, I went ahead and updated the table above accordingly
Context
In response to comment:
Description
Currently we have
ncbi_serotype
because we are relying on "NCBI" annotation as the source of serotype assignment. No change to the column name herenextclade_subtype
because we are using "nextclade" for genotype assignment. Rename this to "nextclade_genotype"Of course feel free to comment on this GitHub Issue with other suggestions. Optionally, we could reorder the metadata columns such that
ncbi_serotype
andnextclade_genotype
are next to each other to make this distinction more obvious to people manually looking at the metadata file.