Add dengue lineages dataset

jamessiqueirap commented 2 months ago

These datasets are based on the dengue virus lineage systems described by Verity et al., 2024, and are suitable for the analysis of viral sequences from the four dengue virus serotypes.

ivan-aksamentov commented 2 months ago

Hi @jamessiqueirap! Thanks!

I will let our science team to review. It has been challenging for them to produce serotype datasets so far. Let's see what they say.

In the meantime a couple of technical nuances:

Could you please create an additional level of directories to make sure the datasets for any pathogens are not directly in the community/ directory, as described here: https://github.com/nextstrain/nextclade_data/blob/master/docs/dataset-curation-guide.md#dataset-paths

[...] We only ask to not submit datasets directly into the community/, to avoid clashes between datasets from different authors and organizations. [...]
This is not mandatory, but would be nice to have some example sequences for each dataset as sequences.fasta (and to declare them in the pathogen.json field "files"as "examples": "sequences.fasta") - this allows Nextclade users to quickly try a dataset and decide if they like it or not. Also helpful for Nextstrain team to review the datasets and to debug our software. So if you have some sequences with permissive licenses, please add them. Somewhere between 10 to 100 sequences should be perfect.

jamessiqueirap commented 2 months ago

Hi @ivan-aksamentov, thank you very much for your valuable feedback and corrections!

I have already implemented all the suggested changes, including creating the additional level of directories and I have already included the example sequences, following your advice. I appreciate your guidance on this — it is really helpful.

Thanks again

ivan-aksamentov commented 2 months ago

I pushed results of the rebuild script:

./scripts/rebuild --input-dir 'data/' --output-dir 'data_output/'

(This is normally done automatically, but we haven't figured how to do this securely for third-party contributions yet. This needs to be rerun if there are changes to the data/ directory)

This allows to use the data_output/ as a dataset server:

https://clades.nextstrain.org/?dataset-server=gh:jamessiqueirap/dengue-lineages-dataset@master@/data_output

Here are the links to Nextcalde with datasets preselected, for easier testing:

rneher commented 2 months ago

Hi @jamessiqueirap , thanks for contributing these. Very exciting!

The trees look good to me.

Two things I noticed:

the private mutation QC parameters are too stringent. My usual rule of thumb is that the typical value should be similar to the average number of mutations on terminal branches of the reference tree. The cut-off value I normally set to 3 times the typical value.
you have one dataset of each serotype. So you probably don't need the all-serotypes in the path. I'd say community/v-gen-lab/dengue-lineages/denv4 is preferable to community/v-gen-lab/dengue-lineages/all-serotypes/denv4. If you want, you can later add an all-serotypes tree along side denv1, denv2, etc.

jamessiqueirap commented 2 months ago

Hi @jamessiqueirap , thanks for contributing these. Very exciting!

The trees look good to me.

Two things I noticed:

the private mutation QC parameters are too stringent. My usual rule of thumb is that the typical value should be similar to the average number of mutations on terminal branches of the reference tree. The cut-off value I normally set to 3 times the typical value.

you have one dataset of each serotype. So you probably don't need the all-serotypes in the path. I'd say community/v-gen-lab/dengue-lineages/denv4 is preferable to community/v-gen-lab/dengue-lineages/all-serotypes/denv4. If you want, you can later add an all-serotypes tree along side denv1, denv2, etc.

@rneher Thank you so much! I think this is a great approach, and I've already implemented it.

corneliusroemer commented 2 months ago

You can use these URLs to test directly, without having to wait for Ivan to rebuild:

corneliusroemer commented 2 months ago

Impressive work, great job making these datasets! A few thoughts and comments (not necessarily blocking release)

General points

You could shorten the path to dengue (i.e. removing the -lineages from v-gen-lab/dengue-lineages/denv1 to make it v-gen-lab/dengue/denv1) to keep in line with the pattern that we just mention the virus name and nothing else. Almost all datasets have lineages so this is redundant.
One could remove pr from the genome annotation as it is a true subset of M, being in frame (i.e. it's just duplication of information)
It might be nice to have the lineage colors be topologically ordered (using color ordering). They currently mostly are, but there are a few outliers, e.g. 1I is followed by 1II rather than by 1I_A etc.
It might be nice to include collection country names in the strain names. If you download them from NCBI Virus, you can customize the strain name to include the collection country.
Some branches are really long, one might potentially want to exclude sequences on long branches as these are either sequencing/assembly errors or due to recombination.
It might be good to list the strain name of the reference under which the sequence is usually known. In this case it seems to be 45AZ5 for denv1 and Thailand/16681/84 for denv2 - not sure if this rings a bell for anyone but it might
It might be good to include the ref seq of each dataset in the example sequences, so one can see where it falls on the tree.
DENV1:
- Lineage 1VI seems to be entirely missing, is this on purpose? Maybe it's gone out of business? Or is this a typo? In the preprint it mentions VI but not VII
DENV2:
- In DENV2, there's a very overdiverged sequence, probably best to exclude (likely artefact): OM744110.1|2021-11-16
- I can't find lineage/serotype 2I, is this on purpose?

Regarding recombination - I was wondering how the lineage system plans to deal with it. It's reportedly common in Dengue (e.g. https://www.pnas.org/doi/full/10.1073/pnas.96.13.7352, https://pubmed.ncbi.nlm.nih.gov/10331266/). I had a look at the lineage preprint but couldn't find mention of how to deal with recombination (I did a string search for recomb and only found 3 hits, all in reference to SARS-CoV-2). SARS-CoV-2 pango gives recombinants special names (e.g. XBB) - has it been discussed how recombination will be treated in the Dengue nomenclature? This might be something worth discussing in the paper, potentially.

jamessiqueirap commented 2 months ago

@corneliusroemer Thank you very much, I'm delighted to receive suggestions from you! The implementation process for dengue lineage nomenclature is a collaborative effort involving various research groups from different countries.

Regarding your suggestions:

General points

You could shorten the path to dengue (i.e., removing the -lineages from v-gen-lab/dengue-lineages/denv1 to make it v-gen-lab/dengue/denv1) to keep in line with the pattern that we just mention the virus name and nothing else. Almost all datasets have lineages, so this is redundant.

One could remove pr from the genome annotation as it is a true subset of M, being in frame (i.e., it's just duplication of information).

It might be nice to have the lineage colors be topologically ordered (using color ordering). They currently mostly are, but there are a few outliers, e.g., 1I is followed by 1II rather than by 1I_A, etc.

It might be nice to include collection country names in the strain names. If you download them from NCBI Virus, you can customize the strain name to include the collection country.

It might be good to list the strain name of the reference under which the sequence is usually known. In this case, it seems to be 45AZ5 for DENV1 and Thailand/16681/84 for DENV2 - not sure if this rings a bell for anyone, but it might.

It might be good to include the ref seq of each dataset in the example sequences so one can see where it falls on the tree.

You are absolutely correct, and I will implement these changes in the dataset right away.

Some branches are really long, one might potentially want to exclude sequences on long branches as these are either sequencing/assembly errors or due to recombination.

DENV2:

In DENV2, there's a very overdiverged sequence, probably best to exclude (likely artefact): OM744110.1|2021-11-16

These trees only contain the representative topology of each designated lineage, and in some cases, certain branches may indeed appear very long. However, we will review these points with the other members of the project.

I can't find lineage/serotype 2I, is this on purpose?

Given the granularity of lineages, this label only appears when you enable the visualization of all lineages. Screenshot_20240814_133627

DENV1:

Lineage 1VI seems to be entirely missing, is this on purpose? Maybe it's gone out of business? Or is this a typo? In the preprint it mentions VI but not VII

Regarding this, in the preprint, we used "1VI" to designate what we will now call "1VII" based on the suggestion of the scientific committee. Since there is already another genotype in the literature designated as "VI," we made this change to avoid any mismatch with the literature. We didn't include VI in the dataset as it's no longer in circulation.

Regarding recombination - I was wondering how the lineage system plans to deal with it. It's reportedly common in Dengue (e.g. https://www.pnas.org/doi/full/10.1073/pnas.96.13.7352, https://pubmed.ncbi.nlm.nih.gov/10331266/). I had a look at the lineage preprint but couldn't find mention of how to deal with recombination (I did a string search for recomb and only found 3 hits, all in reference to SARS-CoV-2). SARS-CoV-2 pango gives recombinants special names (e.g. XBB) - has it been discussed how recombination will be treated in the Dengue nomenclature? This might be something worth discussing in the paper, potentially.

Regarding the nomenclature of recombinants, how they will be addressed is not yet fully defined. Nonetheless, I appreciate your concern, and I will bring this up for discussion with the project members.

corneliusroemer commented 2 months ago

If you have any questions on how to implement certain things do let me know! It's a pleasure to see others make datasets and I want to help as much as I can!

Regarding long branches, we usually exclude them as they are most likely sequencing errors or potentially recombination, both of which can mess up the tree. Removing them doesn't cause problems in lineage assignments.

I'd only include long branches if there's clear evidence the are real (most likely only recombinants)

jamessiqueirap commented 2 months ago

@corneliusroemer I have implemented the changes you suggested. However, I couldn't find an efficient (and aesthetically pleasing) way to add country information directly to the sequence names. To address this, I opted to export the country information for each sample. Now, it's possible to both check the country of origin via shift + click and color the branches according to country information. Thank you so much once again!

ivan-aksamentov commented 2 months ago

I pushed a rebuild to assess how it works in Nextclade Web as a whole (index, examples, columns, tree, exports, autosuggestions etc.)

@jamessiqueirap There is a small defect in the tree.json files:

{
  "meta": {
    "extensions": {
      "nextclade": {
        "clade_node_attrs": [
          {
            "name": "clade_membership",
            "displayName": "Dengue Lineages (Nextclade)",
            "description": ""
          }
        ]
      }
    }
  }
}

The clade_membership ("built-in" or "default" clades) attribute is treated specially and does not need to be declared in the clade_node_attrs. So this entire extensions object can be safely removed. Only additional clade-like node attributes (e.g. a competing second nomenclature) needs to be declared there. With this object in place, currently Nextclade is confused and creates and additional empty column Dengue Lineages (Nextclade) in web and an empty clade_membership in output TSV files. Not critical but would be nice to remove.

I can remove easily from the tree.json files here, but you probably want to remove it from your workflow repo as well. Let me know.

jamessiqueirap commented 2 months ago

@ivan-aksamentov Thank you very much for the observation; this was part of some old tests we were doing, but I've already corrected all the files! Let's try again haha

ivan-aksamentov commented 2 months ago

@jamessiqueirap Done!

This looks good to me. I don't have any more technical recommendations. If science team has no other comments, then this is ready to be merged and released. And we can of course release followup updates and fixes any time if needed.

corneliusroemer commented 2 months ago

Excellent stuff @jamessiqueirap, really great job!

Here's a second round of comments, please don't see them as criticism :) They don't need to be implemented, they are suggestions, you can also work on them later if you like after release:

Strain name of ref 4 is "rDEN4" I think - it might not be such a great reference if it's a recombinant clone that was a vaccine candidate. But that's maybe for Eneida Hatcher (don't know her Github account name), as you're just using the official refseq - so that's fine, maybe they could add another one that's more typical.
You could add your affiliation to the README, i.e. your lab/uni
Typo in readme: , also you can add line breaks there before the "For bugs" by adding an extra line break in the markdown. or end the previous line with a space
You could enable the cluster QC metric - but not necessary
The color scale is a bit random - you could use color ordering the way we do in most nextstrain workflows, see:
- https://github.com/nextstrain/mpox/blob/2ce0d9284ccc8cf9b06e8094c7fa28c8f9d85771/nextclade/Snakefile#L423-L437
- https://github.com/nextstrain/mpox/blob/master/nextclade/scripts/assign-colors.py
You can remove the "lineages" from the dataset name, currently it's "DENV-2 lineages", we don't need to say that there are lineages in there as there's no other dataset without lineages
There's still the kind of unnecessary pr gene in the genome annotation which is just a subset of prM - I'd probably remove it, you can call the result prM

I think that's it for now - let me know if you have any questions.

jamessiqueirap commented 2 months ago

@corneliusroemer thank you so much for your suggestions—I truly appreciate it! I went ahead and implemented all the changes right away. I'm really excited about this project and can hardly contain my enthusiasm! 😄

I did run into one issue with the coloration of certain branches. Even though I followed the script's color assignment flow that you recommended, it seems like Nextclade is mixing up some of the tones when exporting the colors.

For example, branches 1I_K and 1I_K.1 were assigned the colors #5098B9 and #539CB3, which should be in the blue palette, but they’re showing up as shades of red in the exported tree. I'm a bit puzzled by this...

corneliusroemer commented 2 months ago

I'm really excited about this project and can hardly contain my enthusiasm! 😄

That's great to hear :) If you enjoy making Nextclade datasets, there are many viruses left that people would love having datasets for! For example Chikungunya, I've heard!

I did run into one issue with the coloration of certain branches. Even though I followed the script's color assignment flow that you recommended, it seems like Nextclade is mixing up some of the tones when exporting the colors.

For example, branches 1I_K and 1I_K.1 were assigned the colors #5098B9 and #539CB3, which should be in the blue palette, but they’re showing up as shades of red in the exported tree. I'm a bit puzzled by this...

I'd be more than happy to have a look at your workflow. Where does it live? I tried the repo mentioned in the README but that seems to be just a folder with the workflow outputs, not the source code of the workflow that makes the datasets.

jamessiqueirap commented 2 months ago

That's great to hear :) If you enjoy making Nextclade datasets, there are many viruses left that people would love having datasets for! For example Chikungunya, I've heard!

@corneliusroemer Thank you! My group is actually very interested in developing something for Chikungunya, especially since the center where I'm working on my doctorate may soon start monitoring this pathogen. We're excited about the possibilities!

I'd be more than happy to have a look at your workflow. Where does it live? I tried the repo mentioned in the README but that seems to be just a folder with the workflow outputs, not the source code of the workflow that makes the datasets.

You're absolutely right—the links currently lead to the outputs because, to be honest, I'm still learning a lot about the tool. Everything I've done so far has been a bit of a manual, brute-force effort, hahaha! But I'm working on getting more organized!

corneliusroemer commented 2 months ago

Great, let me know if want to make CHIKV or similar, I'm happy to help!

Regarding workflow organization, I recommend you look into snakemake. It's how I make all my workflows. See e.g. the mpox one: https://github.com/nextstrain/mpox/tree/master/nextclade

jamessiqueirap commented 2 months ago

Great, let me know if want to make CHIKV or similar, I'm happy to help!

@corneliusroemer Thanks a lot! We’ll definitely get in touch when we start on CHIKV or something similar.

Regarding workflow organization, I recommend you look into snakemake. It's how I make all my workflows. See e.g. the mpox one: https://github.com/nextstrain/mpox/tree/master/nextclade

I wanted to let you know that your m-pox folder was incredibly helpful in showing me how to set up my workflow. Take a look!!

The process kicks off after the trees are created because, as I mentioned before, it was built using representative trees from a lot more samples, generated by another team member.

corneliusroemer commented 2 months ago

I wanted to let you know that your m-pox folder was incredibly helpful in showing me how to set up my workflow. Take a look!!

It's amazing! I'll have a thorough look once I have a little more free time, I'm super busy with https://pathoplexus.org/ at the moment (that contributed to the delay, sorry for that!)

The process kicks off after the trees are created because, as I mentioned before, it was built using representative trees from a lot more samples, generated by another team member.

Great! That's good to know.

I will merge your PR and release - congratulations on this fantastic work! Please open a PR/issue or write an email if you'd like to contribute datasets for other pathogens.

jamessiqueirap commented 2 months ago

@corneliusroemer Thank you so much! Do you have any idea when the dataset might be released on the Nextclade site? We’re wrapping up the manuscript submission, and it would be fantastic if we could include the announcement in the final version. 😊

Additionally, I'm excited to collaborate on creating new datasets for other pathogens!

corneliusroemer commented 2 months ago

I will release now, it's on master already: master.clades.nextstrain.org - but release will follow in <1hr - thanks for the reminder I got interrupted by some urgent pathoplexus.org thing again.

corneliusroemer commented 2 months ago

@jamessiqueirap here we go - it's released on release branch - will take another 5min until you can see it on clades.nextstrain.org - but it will be there soon. Ping me if not!

https://github.com/nextstrain/nextclade_data/releases/tag/2024-08-31--20-44-06Z

corneliusroemer commented 2 months ago

@jamessiqueirap it's live on clades.nextstrain.org

For the manuscript, you can provide a link that automatically selects the right dataset like this:

https://clades.nextstrain.org/?dataset-name=community/v-gen-lab/dengue/denv1
https://clades.nextstrain.org/?dataset-name=community/v-gen-lab/dengue/denv2
https://clades.nextstrain.org/?dataset-name=community/v-gen-lab/dengue/denv3
https://clades.nextstrain.org/?dataset-name=community/v-gen-lab/dengue/denv4

Try it out: https://clades.nextstrain.org/?dataset-name=community/v-gen-lab/dengue/denv1

jamessiqueirap commented 2 months ago

@corneliusroemer Awesome! thank you!

corneliusroemer commented 2 months ago

No, thank you 😄

nextstrain / nextclade_data

Add dengue lineages dataset #223

General points

General points