Open j23414 opened 5 months ago
Cool!
I'll just drop short links for testing:
Sadly I don't have any example sequences to run :(
Are there any sequences with permissive licenses available to add them as example sequences into datasets?
Would be nice to fill-in some info to the readme if you have a second. Readme is an optional file though.
I can add some example sequences, it may take me a moment (aka. not in the next hour).
@j23414 No worries at all. I will not be able to asses the coolness of it anyways, because I lack required science knowledge. But I'll happily test how it runs and whether any bugs manifest themselves sometimes :)
FWIW here are some arbitrarily chosen dengue sequences that I use as examples on dev.usher.bio:
https://www.ncbi.nlm.nih.gov/nuccore/OQ605998.1 https://www.ncbi.nlm.nih.gov/nuccore/OQ445967.1 https://www.ncbi.nlm.nih.gov/nuccore/OQ821618.1| https://www.ncbi.nlm.nih.gov/nuccore/OQ622206.1
And NCBI Virus can provide a bunch: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=Dengue%20virus,%20taxid:12637
I lack required science knowledge.
Haha, I also lack the required science knowledge. I'm mostly wandering in the dark.
We need that! @jamessiqueirap and I proposed a lineage system for dengue. Our work utilizes the genotype mutations table from Nextstrain Dengue.
A small technical suggestion. These sequences seem to contain many mutations - too many for browser SVG engine to render efficiently in Nextclade's sequence views. If there's a clear "main" gene/CDS of interest for this virus, then one workaround would be to set the default CDS in pathogen.json, so that sequence view automatically switches to it when first rendering:
"defaultCds": "S",
Nuc sequence will of course still be available in the dropdown. But users will pay the associated performance price only if they switch to it.
And, on related note, if you need to customize the order of genes in the dropdown, then you could also add
"cdsOrderPreference": [
"S",
"N",
"M",
"E"
],
Both are just eye-candy features, so no rush.
The ultimate solution will be to implement a more performant sequence viewer in Nextclade. But this is quite far away.
Thanks @ivan-aksamentov! I think the "main" gene/cds-of-interest for dengue is the E gene, and sometimes it's the only portion of the genome sequenced based on this user comment. I can add to the pathogen.json files following this pattern.
"defaultCds": "E",
Good to know about customizing the order of genes in drop down!
Right now, the dropdown menu matches the gene/cds order in the genome, which feels logical and straightforward to me. However, I welcome differing perspectives on this matter from others in the field. Open to alternatives or potential improvements.
A few additional remarks:
files
struct in the pathogen.json. Thanks @rneher! I tried to incorporate your suggested changes in https://github.com/nextstrain/nextclade_data/pull/203/commits/610e3f5c7a52b61675226282bdf492d899e3bfec
[example sequences] file is missing from the files struct in the pathogen.json.
An oversight on my part, fixed.
you didn't enable any QC. is this on purpose?
I had turned off several QC during development since dengue sequences seemed very divergent. I agree with adding stop and frameshift QC back in, done.
I would include one outgroup sequence (for example your reconstructed ancestor) to pick up non serotype sequences. The branch to that outgroup could be artificially shortened is necessary.
For genotype-level datasets (denv1-4), I swapped in the inferred ancestral root in for the reference and root of the tree. Done, although I could use help in evaluating the genotype-level datasets or any suggested next steps.
I thought about blasting a serotype's sequences against the other 3 serotypes to find the nearest cross-serotype outgroup, but wasn't sure if that would be more or less effective then the inferred ancestral root. Or using the other 3 serotype's inferred-ancestral roots as outgroups. I wasn't sure, but suggestions welcome.
I am not sure, but this is now the root of the DENV1 tree:
and an entire clade in that tree is DENV2/S
are you sure this is correct?
I think the root should be given a clade unassigned
or outgroup
or not DENV1
and you could root the tree mid-way to the outgroup. I would also color it grey.
are you sure this is correct?
Gah, I must have copied in the wrong dataset files. I was experimenting with using the "dengue/all" reconstructed root for all 4 serotypes. However, as you observed, it was giving me weird genotype calls (e.g. DENV2 genotypes in the DENV1 tree).
I'll copy the correct ones (and double check this time) in a moment.
I allowed myself to resolve merge conflict which appeared after merging measles #202
thanks, Jennifer. The dataset also contains the genotype annotations. If these are good, you could enable them by adding to them to the meta.extensions
as clade-like attributes.
Also, the example data contain two sequences that don't align. That is not a problem per se if these sequences are very weird (and having examples of bad sequences is fine), but if this is unexpected that one could maybe tune parameters.
The dataset also contains the genotype annotations.
Thanks for the question @rneher! Some clarification that the genotype annotations (named genotype_nextclade
) are from some placeholder DENV1-4 Nextclade datasets that were created so long ago I'm unsure of dataset provenance. (ergo, I wouldn't trust it yet). It requires predetermining the serotype (from NCBI annotations) before querying it against the DENV1-4 datasets individually and being appended to the metadata.
The more concerning problem occurs when we zoom into individual serotype trees (e.g. DENV2) where the genotype_nextclade
does not quite match the augur clades
's clade_membership
annotations.
I believe @trvrb was going to explore modifying aa-mut defining mutations in clades_genotypes.tsv to apply to the "all" tree. Currently the aa-coords are by serotype reference (e.g. against the DENV1 reference, against the DENV3 reference which has a two amino acid deletion in E gene, etc).
the example data contain two sequences that don't align.
Thanks for flagging! I assume it's an example sequence with >GenBankID_?
headers and not something with a serotype annotation (e.g. >GenBankID_DENV1
). I'm expecting them not to align and are representing some sylvatic (in the wild and lowly monitored) samples, but yes we can explore "rescuing" them (or assigning them a new serotype) if that makes biological sense.
@j23414 I think why Richard is asking about the unalignable or otherwise "broken" (from the point of view of Nextclade results) example sequences is that we had a situation with SC2 dataset, when users come confused after trying Nextclade with example sequences and receiving error or warning messages. They thought they did something wrong or that there is a bug. So we try to keep examples nice and high quality since then.
From one side, "broken" sequence might tell a story about some interesting science fact or just showcase how Nextclade software handles that particular situation technically - which is interesting. On the other hand, without context it might be unclear for the target audience. If you plan on keeping these samples, then perhaps you could explain the details in the readme. Alternatively, there might be a sciency solution, as you mentioned, to make them "good". Otherwise you could just delete the bad examples to avoid the troubles.
I see, thank you for the context! I agree that dropping them (unalignable example sequences) is the smoother path forward.
Dropped from the other PR https://github.com/nextstrain/nextclade_data/pull/208 in commit https://github.com/nextstrain/nextclade_data/pull/208/commits/932affc2260bdb1e5f4ef43c0c839877ecdcdafc
Since the "all" directory is being added in the other PR, I've ommited the "all" directory here as commit https://github.com/nextstrain/nextclade_data/pull/203/commits/88171f43aa3b92ac7d26de69d4a17934f3041d4d
Add a dengue dataset to Nextclade.