Closed jamessiqueirap closed 2 months ago
Hi @jamessiqueirap! Thanks!
I will let our science team to review. It has been challenging for them to produce serotype datasets so far. Let's see what they say.
In the meantime a couple of technical nuances:
Could you please create an additional level of directories to make sure the datasets for any pathogens are not directly in the community/
directory, as described here: https://github.com/nextstrain/nextclade_data/blob/master/docs/dataset-curation-guide.md#dataset-paths
[...] We only ask to not submit datasets directly into the community/, to avoid clashes between datasets from different authors and organizations. [...]
This is not mandatory, but would be nice to have some example sequences for each dataset as sequences.fasta
(and to declare them in the pathogen.json
field "files"
as "examples": "sequences.fasta"
) - this allows Nextclade users to quickly try a dataset and decide if they like it or not. Also helpful for Nextstrain team to review the datasets and to debug our software. So if you have some sequences with permissive licenses, please add them. Somewhere between 10 to 100 sequences should be perfect.
Hi @ivan-aksamentov, thank you very much for your valuable feedback and corrections!
I have already implemented all the suggested changes, including creating the additional level of directories and I have already included the example sequences, following your advice. I appreciate your guidance on this — it is really helpful.
Thanks again
I pushed results of the rebuild
script:
./scripts/rebuild --input-dir 'data/' --output-dir 'data_output/'
(This is normally done automatically, but we haven't figured how to do this securely for third-party contributions yet. This needs to be rerun if there are changes to the data/
directory)
This allows to use the data_output/
as a dataset server:
Here are the links to Nextcalde with datasets preselected, for easier testing:
Hi @jamessiqueirap , thanks for contributing these. Very exciting!
The trees look good to me.
Two things I noticed:
typical
value should be similar to the average number of mutations on terminal branches of the reference tree. The cut-off value I normally set to 3 times the typical value. all-serotypes
in the path. I'd say community/v-gen-lab/dengue-lineages/denv4
is preferable to community/v-gen-lab/dengue-lineages/all-serotypes/denv4
. If you want, you can later add an all-serotypes
tree along side denv1
, denv2
, etc.Hi @jamessiqueirap , thanks for contributing these. Very exciting!
The trees look good to me.
Two things I noticed:
- the private mutation QC parameters are too stringent. My usual rule of thumb is that the
typical
value should be similar to the average number of mutations on terminal branches of the reference tree. The cut-off value I normally set to 3 times the typical value.- you have one dataset of each serotype. So you probably don't need the
all-serotypes
in the path. I'd saycommunity/v-gen-lab/dengue-lineages/denv4
is preferable tocommunity/v-gen-lab/dengue-lineages/all-serotypes/denv4
. If you want, you can later add anall-serotypes
tree along sidedenv1
,denv2
, etc.
@rneher Thank you so much! I think this is a great approach, and I've already implemented it.
You can use these URLs to test directly, without having to wait for Ivan to rebuild:
Impressive work, great job making these datasets! A few thoughts and comments (not necessarily blocking release)
dengue
(i.e. removing the -lineages
from v-gen-lab/dengue-lineages/denv1
to make it v-gen-lab/dengue/denv1
) to keep in line with the pattern that we just mention the virus name and nothing else. Almost all datasets have lineages so this is redundant.pr
from the genome annotation as it is a true subset of M
, being in frame (i.e. it's just duplication of information)1I
is followed by 1II
rather than by 1I_A
etc.45AZ5
for denv1 and Thailand/16681/84
for denv2 - not sure if this rings a bell for anyone but it might1VI
seems to be entirely missing, is this on purpose? Maybe it's gone out of business? Or is this a typo? In the preprint it mentions VI
but not VII
OM744110.1|2021-11-16
2I
, is this on purpose?Regarding recombination - I was wondering how the lineage system plans to deal with it. It's reportedly common in Dengue (e.g. https://www.pnas.org/doi/full/10.1073/pnas.96.13.7352, https://pubmed.ncbi.nlm.nih.gov/10331266/). I had a look at the lineage preprint but couldn't find mention of how to deal with recombination (I did a string search for recomb
and only found 3 hits, all in reference to SARS-CoV-2). SARS-CoV-2 pango gives recombinants special names (e.g. XBB) - has it been discussed how recombination will be treated in the Dengue nomenclature? This might be something worth discussing in the paper, potentially.
@corneliusroemer Thank you very much, I'm delighted to receive suggestions from you! The implementation process for dengue lineage nomenclature is a collaborative effort involving various research groups from different countries.
Regarding your suggestions:
General points
- You could shorten the path to
dengue
(i.e., removing the-lineages
fromv-gen-lab/dengue-lineages/denv1
to make itv-gen-lab/dengue/denv1
) to keep in line with the pattern that we just mention the virus name and nothing else. Almost all datasets have lineages, so this is redundant.- One could remove
pr
from the genome annotation as it is a true subset ofM
, being in frame (i.e., it's just duplication of information).- It might be nice to have the lineage colors be topologically ordered (using color ordering). They currently mostly are, but there are a few outliers, e.g.,
1I
is followed by1II
rather than by1I_A
, etc.- It might be nice to include collection country names in the strain names. If you download them from NCBI Virus, you can customize the strain name to include the collection country.
- It might be good to list the strain name of the reference under which the sequence is usually known. In this case, it seems to be
45AZ5
for DENV1 andThailand/16681/84
for DENV2 - not sure if this rings a bell for anyone, but it might.- It might be good to include the ref seq of each dataset in the example sequences so one can see where it falls on the tree.
You are absolutely correct, and I will implement these changes in the dataset right away.
- Some branches are really long, one might potentially want to exclude sequences on long branches as these are either sequencing/assembly errors or due to recombination.
DENV2:
- In DENV2, there's a very overdiverged sequence, probably best to exclude (likely artefact):
OM744110.1|2021-11-16
These trees only contain the representative topology of each designated lineage, and in some cases, certain branches may indeed appear very long. However, we will review these points with the other members of the project.
- I can't find lineage/serotype
2I
, is this on purpose?
Given the granularity of lineages, this label only appears when you enable the visualization of all lineages.
DENV1:
- Lineage
1VI
seems to be entirely missing, is this on purpose? Maybe it's gone out of business? Or is this a typo? In the preprint it mentionsVI
but notVII
Regarding this, in the preprint, we used "1VI" to designate what we will now call "1VII" based on the suggestion of the scientific committee. Since there is already another genotype in the literature designated as "VI," we made this change to avoid any mismatch with the literature. We didn't include VI in the dataset as it's no longer in circulation.
Regarding recombination - I was wondering how the lineage system plans to deal with it. It's reportedly common in Dengue (e.g. https://www.pnas.org/doi/full/10.1073/pnas.96.13.7352, https://pubmed.ncbi.nlm.nih.gov/10331266/). I had a look at the lineage preprint but couldn't find mention of how to deal with recombination (I did a string search for
recomb
and only found 3 hits, all in reference to SARS-CoV-2). SARS-CoV-2 pango gives recombinants special names (e.g. XBB) - has it been discussed how recombination will be treated in the Dengue nomenclature? This might be something worth discussing in the paper, potentially.
Regarding the nomenclature of recombinants, how they will be addressed is not yet fully defined. Nonetheless, I appreciate your concern, and I will bring this up for discussion with the project members.
If you have any questions on how to implement certain things do let me know! It's a pleasure to see others make datasets and I want to help as much as I can!
Regarding long branches, we usually exclude them as they are most likely sequencing errors or potentially recombination, both of which can mess up the tree. Removing them doesn't cause problems in lineage assignments.
I'd only include long branches if there's clear evidence the are real (most likely only recombinants)
@corneliusroemer I have implemented the changes you suggested. However, I couldn't find an efficient (and aesthetically pleasing) way to add country information directly to the sequence names. To address this, I opted to export the country information for each sample. Now, it's possible to both check the country of origin via shift + click and color the branches according to country information. Thank you so much once again!
I pushed a rebuild to assess how it works in Nextclade Web as a whole (index, examples, columns, tree, exports, autosuggestions etc.)
@jamessiqueirap There is a small defect in the tree.json files:
{
"meta": {
"extensions": {
"nextclade": {
"clade_node_attrs": [
{
"name": "clade_membership",
"displayName": "Dengue Lineages (Nextclade)",
"description": ""
}
]
}
}
}
}
The clade_membership
("built-in" or "default" clades) attribute is treated specially and does not need to be declared in the clade_node_attrs
. So this entire extensions
object can be safely removed. Only additional clade-like node attributes (e.g. a competing second nomenclature) needs to be declared there. With this object in place, currently Nextclade is confused and creates and additional empty column Dengue Lineages (Nextclade)
in web and an empty clade_membership
in output TSV files. Not critical but would be nice to remove.
I can remove easily from the tree.json
files here, but you probably want to remove it from your workflow repo as well. Let me know.
@ivan-aksamentov Thank you very much for the observation; this was part of some old tests we were doing, but I've already corrected all the files! Let's try again haha
@jamessiqueirap Done!
This looks good to me. I don't have any more technical recommendations. If science team has no other comments, then this is ready to be merged and released. And we can of course release followup updates and fixes any time if needed.
Excellent stuff @jamessiqueirap, really great job!
Here's a second round of comments, please don't see them as criticism :) They don't need to be implemented, they are suggestions, you can also work on them later if you like after release:
pr
gene in the genome annotation which is just a subset of prM
- I'd probably remove it, you can call the result prM
I think that's it for now - let me know if you have any questions.
@corneliusroemer thank you so much for your suggestions—I truly appreciate it! I went ahead and implemented all the changes right away. I'm really excited about this project and can hardly contain my enthusiasm! 😄
I did run into one issue with the coloration of certain branches. Even though I followed the script's color assignment flow that you recommended, it seems like Nextclade is mixing up some of the tones when exporting the colors.
For example, branches 1I_K and 1I_K.1 were assigned the colors #5098B9 and #539CB3, which should be in the blue palette, but they’re showing up as shades of red in the exported tree. I'm a bit puzzled by this...
I'm really excited about this project and can hardly contain my enthusiasm! 😄
That's great to hear :) If you enjoy making Nextclade datasets, there are many viruses left that people would love having datasets for! For example Chikungunya, I've heard!
I did run into one issue with the coloration of certain branches. Even though I followed the script's color assignment flow that you recommended, it seems like Nextclade is mixing up some of the tones when exporting the colors.
For example, branches 1I_K and 1I_K.1 were assigned the colors #5098B9 and #539CB3, which should be in the blue palette, but they’re showing up as shades of red in the exported tree. I'm a bit puzzled by this...
I'd be more than happy to have a look at your workflow. Where does it live? I tried the repo mentioned in the README but that seems to be just a folder with the workflow outputs, not the source code of the workflow that makes the datasets.
That's great to hear :) If you enjoy making Nextclade datasets, there are many viruses left that people would love having datasets for! For example Chikungunya, I've heard!
@corneliusroemer Thank you! My group is actually very interested in developing something for Chikungunya, especially since the center where I'm working on my doctorate may soon start monitoring this pathogen. We're excited about the possibilities!
I'd be more than happy to have a look at your workflow. Where does it live? I tried the repo mentioned in the README but that seems to be just a folder with the workflow outputs, not the source code of the workflow that makes the datasets.
You're absolutely right—the links currently lead to the outputs because, to be honest, I'm still learning a lot about the tool. Everything I've done so far has been a bit of a manual, brute-force effort, hahaha! But I'm working on getting more organized!
Great, let me know if want to make CHIKV or similar, I'm happy to help!
Regarding workflow organization, I recommend you look into snakemake. It's how I make all my workflows. See e.g. the mpox one: https://github.com/nextstrain/mpox/tree/master/nextclade
Great, let me know if want to make CHIKV or similar, I'm happy to help!
@corneliusroemer Thanks a lot! We’ll definitely get in touch when we start on CHIKV or something similar.
Regarding workflow organization, I recommend you look into snakemake. It's how I make all my workflows. See e.g. the mpox one: https://github.com/nextstrain/mpox/tree/master/nextclade
I wanted to let you know that your m-pox folder was incredibly helpful in showing me how to set up my workflow. Take a look!!
The process kicks off after the trees are created because, as I mentioned before, it was built using representative trees from a lot more samples, generated by another team member.
I wanted to let you know that your m-pox folder was incredibly helpful in showing me how to set up my workflow. Take a look!!
It's amazing! I'll have a thorough look once I have a little more free time, I'm super busy with https://pathoplexus.org/ at the moment (that contributed to the delay, sorry for that!)
The process kicks off after the trees are created because, as I mentioned before, it was built using representative trees from a lot more samples, generated by another team member.
Great! That's good to know.
I will merge your PR and release - congratulations on this fantastic work! Please open a PR/issue or write an email if you'd like to contribute datasets for other pathogens.
@corneliusroemer Thank you so much! Do you have any idea when the dataset might be released on the Nextclade site? We’re wrapping up the manuscript submission, and it would be fantastic if we could include the announcement in the final version. 😊
Additionally, I'm excited to collaborate on creating new datasets for other pathogens!
I will release now, it's on master already: master.clades.nextstrain.org - but release will follow in <1hr - thanks for the reminder I got interrupted by some urgent pathoplexus.org thing again.
@jamessiqueirap here we go - it's released on release branch - will take another 5min until you can see it on clades.nextstrain.org - but it will be there soon. Ping me if not!
https://github.com/nextstrain/nextclade_data/releases/tag/2024-08-31--20-44-06Z
@jamessiqueirap it's live on clades.nextstrain.org
For the manuscript, you can provide a link that automatically selects the right dataset like this:
https://clades.nextstrain.org/?dataset-name=community/v-gen-lab/dengue/denv1
https://clades.nextstrain.org/?dataset-name=community/v-gen-lab/dengue/denv2
https://clades.nextstrain.org/?dataset-name=community/v-gen-lab/dengue/denv3
https://clades.nextstrain.org/?dataset-name=community/v-gen-lab/dengue/denv4
Try it out: https://clades.nextstrain.org/?dataset-name=community/v-gen-lab/dengue/denv1
@corneliusroemer Awesome! thank you!
No, thank you 😄
These datasets are based on the dengue virus lineage systems described by Verity et al., 2024, and are suitable for the analysis of viral sequences from the four dengue virus serotypes.