nextstrain / nextclade_data

Datasets for https://github.com/nextstrain/nextclade
https://clades.nextstrain.org
32 stars 27 forks source link

Add Dengue virus DENVx genotypes dataset #203

Open j23414 opened 5 months ago

j23414 commented 5 months ago

Add a dengue dataset to Nextclade.

ivan-aksamentov commented 5 months ago

Cool!

I'll just drop short links for testing:

DENV1: https://clades.nextstrain.org/?dataset-server=gh:@add-dengue-dataset@&dataset-name=nextstrain/dengue/denv1

DENV2:https://clades.nextstrain.org/?dataset-server=gh:@add-dengue-dataset@&dataset-name=nextstrain/dengue/denv2

DENV3:https://clades.nextstrain.org/?dataset-server=gh:@add-dengue-dataset@&dataset-name=nextstrain/dengue/denv3

DENV4:https://clades.nextstrain.org/?dataset-server=gh:@add-dengue-dataset@&dataset-name=nextstrain/dengue/denv4

Sadly I don't have any example sequences to run :(

Are there any sequences with permissive licenses available to add them as example sequences into datasets?

Would be nice to fill-in some info to the readme if you have a second. Readme is an optional file though.

j23414 commented 5 months ago

I can add some example sequences, it may take me a moment (aka. not in the next hour).

ivan-aksamentov commented 5 months ago

@j23414 No worries at all. I will not be able to asses the coolness of it anyways, because I lack required science knowledge. But I'll happily test how it runs and whether any bugs manifest themselves sometimes :)

AngieHinrichs commented 5 months ago

FWIW here are some arbitrarily chosen dengue sequences that I use as examples on dev.usher.bio:

https://www.ncbi.nlm.nih.gov/nuccore/OQ605998.1 https://www.ncbi.nlm.nih.gov/nuccore/OQ445967.1 https://www.ncbi.nlm.nih.gov/nuccore/OQ821618.1| https://www.ncbi.nlm.nih.gov/nuccore/OQ622206.1

And NCBI Virus can provide a bunch: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=Dengue%20virus,%20taxid:12637

j23414 commented 5 months ago

I lack required science knowledge.

Haha, I also lack the required science knowledge. I'm mostly wandering in the dark.

alex-ranieri commented 5 months ago

We need that! @jamessiqueirap and I proposed a lineage system for dengue. Our work utilizes the genotype mutations table from Nextstrain Dengue.

ivan-aksamentov commented 5 months ago

A small technical suggestion. These sequences seem to contain many mutations - too many for browser SVG engine to render efficiently in Nextclade's sequence views. If there's a clear "main" gene/CDS of interest for this virus, then one workaround would be to set the default CDS in pathogen.json, so that sequence view automatically switches to it when first rendering:

"defaultCds": "S",

Nuc sequence will of course still be available in the dropdown. But users will pay the associated performance price only if they switch to it.

And, on related note, if you need to customize the order of genes in the dropdown, then you could also add

"cdsOrderPreference": [
  "S",
  "N",
  "M",
  "E"
],

Both are just eye-candy features, so no rush.

The ultimate solution will be to implement a more performant sequence viewer in Nextclade. But this is quite far away.

j23414 commented 5 months ago

Thanks @ivan-aksamentov! I think the "main" gene/cds-of-interest for dengue is the E gene, and sometimes it's the only portion of the genome sequenced based on this user comment. I can add to the pathogen.json files following this pattern.

"defaultCds": "E",

Good to know about customizing the order of genes in drop down!

Right now, the dropdown menu matches the gene/cds order in the genome, which feels logical and straightforward to me. However, I welcome differing perspectives on this matter from others in the field. Open to alternatives or potential improvements.

Screenshot 2024-05-29 at 9 27 21 AM
rneher commented 5 months ago

A few additional remarks:

j23414 commented 5 months ago

Thanks @rneher! I tried to incorporate your suggested changes in https://github.com/nextstrain/nextclade_data/pull/203/commits/610e3f5c7a52b61675226282bdf492d899e3bfec

[example sequences] file is missing from the files struct in the pathogen.json.

An oversight on my part, fixed.

you didn't enable any QC. is this on purpose?

I had turned off several QC during development since dengue sequences seemed very divergent. I agree with adding stop and frameshift QC back in, done.

I would include one outgroup sequence (for example your reconstructed ancestor) to pick up non serotype sequences. The branch to that outgroup could be artificially shortened is necessary.

For genotype-level datasets (denv1-4), I swapped in the inferred ancestral root in for the reference and root of the tree. Done, although I could use help in evaluating the genotype-level datasets or any suggested next steps.

I thought about blasting a serotype's sequences against the other 3 serotypes to find the nearest cross-serotype outgroup, but wasn't sure if that would be more or less effective then the inferred ancestral root. Or using the other 3 serotype's inferred-ancestral roots as outgroups. I wasn't sure, but suggestions welcome.

rneher commented 5 months ago

I am not sure, but this is now the root of the DENV1 tree: image

and an entire clade in that tree is DENV2/S

image

are you sure this is correct?

rneher commented 5 months ago

I think the root should be given a clade unassigned or outgroup or not DENV1 and you could root the tree mid-way to the outgroup. I would also color it grey.

j23414 commented 5 months ago

are you sure this is correct?

Gah, I must have copied in the wrong dataset files. I was experimenting with using the "dengue/all" reconstructed root for all 4 serotypes. However, as you observed, it was giving me weird genotype calls (e.g. DENV2 genotypes in the DENV1 tree).

I'll copy the correct ones (and double check this time) in a moment.

ivan-aksamentov commented 5 months ago

I allowed myself to resolve merge conflict which appeared after merging measles #202

rneher commented 4 months ago

thanks, Jennifer. The dataset also contains the genotype annotations. If these are good, you could enable them by adding to them to the meta.extensions as clade-like attributes.

Also, the example data contain two sequences that don't align. That is not a problem per se if these sequences are very weird (and having examples of bad sequences is fine), but if this is unexpected that one could maybe tune parameters.

j23414 commented 4 months ago

The dataset also contains the genotype annotations.

Thanks for the question @rneher! Some clarification that the genotype annotations (named genotype_nextclade) are from some placeholder DENV1-4 Nextclade datasets that were created so long ago I'm unsure of dataset provenance. (ergo, I wouldn't trust it yet). It requires predetermining the serotype (from NCBI annotations) before querying it against the DENV1-4 datasets individually and being appended to the metadata.

The more concerning problem occurs when we zoom into individual serotype trees (e.g. DENV2) where the genotype_nextclade does not quite match the augur clades's clade_membership annotations.

I believe @trvrb was going to explore modifying aa-mut defining mutations in clades_genotypes.tsv to apply to the "all" tree. Currently the aa-coords are by serotype reference (e.g. against the DENV1 reference, against the DENV3 reference which has a two amino acid deletion in E gene, etc).

j23414 commented 4 months ago

the example data contain two sequences that don't align.

Thanks for flagging! I assume it's an example sequence with >GenBankID_? headers and not something with a serotype annotation (e.g. >GenBankID_DENV1). I'm expecting them not to align and are representing some sylvatic (in the wild and lowly monitored) samples, but yes we can explore "rescuing" them (or assigning them a new serotype) if that makes biological sense.

ivan-aksamentov commented 4 months ago

@j23414 I think why Richard is asking about the unalignable or otherwise "broken" (from the point of view of Nextclade results) example sequences is that we had a situation with SC2 dataset, when users come confused after trying Nextclade with example sequences and receiving error or warning messages. They thought they did something wrong or that there is a bug. So we try to keep examples nice and high quality since then.

From one side, "broken" sequence might tell a story about some interesting science fact or just showcase how Nextclade software handles that particular situation technically - which is interesting. On the other hand, without context it might be unclear for the target audience. If you plan on keeping these samples, then perhaps you could explain the details in the readme. Alternatively, there might be a sciency solution, as you mentioned, to make them "good". Otherwise you could just delete the bad examples to avoid the troubles.

j23414 commented 4 months ago

I see, thank you for the context! I agree that dropping them (unalignable example sequences) is the smoother path forward.

Dropped from the other PR https://github.com/nextstrain/nextclade_data/pull/208 in commit https://github.com/nextstrain/nextclade_data/pull/208/commits/932affc2260bdb1e5f4ef43c0c839877ecdcdafc

Since the "all" directory is being added in the other PR, I've ommited the "all" directory here as commit https://github.com/nextstrain/nextclade_data/pull/203/commits/88171f43aa3b92ac7d26de69d4a17934f3041d4d