nextstrain / dengue

Nextstrain build for dengue virus
https://nextstrain.org/dengue
8 stars 10 forks source link

Add workflow for producing the Nextclade dengue dataset #25

Closed j23414 closed 3 months ago

j23414 commented 7 months ago

Description of proposed changes

Introduce a workflow dedicated to generating the Nextclade dataset for dengue serotypes and subtypes genotypes. This workflow will be housed in a designated nextclade folder, aligning with the pathogen-repo-guide/nextclade. This workflow is for streamlined dataset creation, testing, and debugging.

The changes can be summarized as follows:

  1. Establish a nextclade directory to adhere to the pathogen-repo-guide/nextclade. Start with a copy of the Nextclade README from the pathogen-repo-guide/nextclade repository.
  2. Copy rules from phylogenetic workflow since most of the rules should be the same, for generating the tree.json files.
  3. Modify the rules to deal with reference-root incongruence.
  4. Add files and rules to assemble the nextclade dengue dataset (e.g. pathogen.json). Rules copied from mpox.
  5. Connect rules to test the nextclade dataset (currently failing) based on the mpox nextclade test rule.

Related issue(s)

Checklist

j23414 commented 3 months ago

This PR so-far creates a Nextclade dataset but is stuck on the following errors when testing the assembled dataset:

nextstrain build nextclade test_output/all
view error for dengue/all -> fixed by C or M coordinates ```bash [Sun May 26 05:59:38 2024] rule test_dataset: input: datasets/all/tree.json, datasets/all/pathogen.json, resources/all/sequences.fasta, datasets/all/genome_annotation.gff3, datasets/all/README.md, datasets/all/CHANGELOG.md output: test_output/all jobid: 0 reason: Missing output files: test_output/all wildcards: serotype=all resources: tmpdir=/var/folders/3_/0vmyf52s7dvdr36h6nlvwt7r0000gp/T nextclade run --input-dataset datasets/all --output-all test_output/all --silent resources/all/sequences.fasta Error: 0: When preprocessing Nextclade graph 1: When retrieving aa mutations from reference tree node NODE_0000001 2: Encountered a mutation (S2H) in reference tree branch attributes, for which the origin state of the mutation is inconsistent with the state at the parental branch. Mutations origin state is 'S', but tree (inferred from the reference sequence as no prior mutations were observed at this position) has state 'V'. This is likely an inconsistency between reference tree and reference sequence in the Nextclade dataset. Reference sequence should either correspond to the root of the reference tree or the root of the reference tree needs to account for difference between the tree and reference sequence. Please check that your reference tree is consistent with your reference sequence. Location: packages/nextclade/src/tree/tree_preprocess.rs:226 Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it. Run with RUST_BACKTRACE=full to include source snippets. ```
nextstrain build nextclade test_output/denv1
view error for dengue/denv1 -> Fixed by Jover 🥳 ```bash [Sun May 26 06:02:12 2024] rule test_dataset: input: datasets/denv1/tree.json, datasets/denv1/pathogen.json, resources/all/sequences.fasta, datasets/denv1/genome_annotation.gff3, datasets/denv1/README.md, datasets/denv1/CHANGELOG.md output: test_output/denv1 jobid: 0 reason: Missing output files: test_output/denv1 wildcards: serotype=denv1 resources: tmpdir=/var/folders/3_/0vmyf52s7dvdr36h6nlvwt7r0000gp/T nextclade run --input-dataset datasets/denv1 --output-all test_output/denv1 --silent resources/all/sequences.fasta Error: 0: When preprocessing Nextclade graph 1: When retrieving aa mutations from reference tree node NODE_0000043 2: Encountered a mutation (T59A) in reference tree branch attributes, for which the origin state of the mutation is inconsistent with the state at the parental branch. Mutations origin state is 'T', but tree (inferred from the reference sequence as no prior mutations were observed at this position) has state 'Q'. This is likely an inconsistency between reference tree and reference sequence in the Nextclade dataset. Reference sequence should either correspond to the root of the reference tree or the root of the reference tree needs to account for difference between the tree and reference sequence. Please check that your reference tree is consistent with your reference sequence. Location: packages/nextclade/src/tree/tree_preprocess.rs:226 Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it. Run with RUST_BACKTRACE=full to include source snippets. ```
nextstrain build nextclade test_output/denv2
view error for dengue/denv2 - different error -> fixed by C or M coords ```bash [Sun May 26 06:03:17 2024] rule test_dataset: input: datasets/denv2/tree.json, datasets/denv2/pathogen.json, resources/all/sequences.fasta, datasets/denv2/genome_annotation.gff3, datasets/denv2/README.md, datasets/denv2/CHANGELOG.md output: test_output/denv2 jobid: 0 reason: Missing output files: test_output/denv2 wildcards: serotype=denv2 resources: tmpdir=/var/folders/3_/0vmyf52s7dvdr36h6nlvwt7r0000gp/T nextclade run --input-dataset datasets/denv2 --output-all test_output/denv2 --silent resources/all/sequences.fasta The application panicked (crashed). Message: index out of bounds: the len is 100 but the index is 100 Location: packages/nextclade/src/tree/tree_preprocess.rs:213 Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it. Run with RUST_BACKTRACE=full to include source snippets. ```
nextstrain build nextclade test_output/denv3
view error for dengue/denv3 -> fixed by C or M coords ```bash [Sun May 26 06:06:18 2024] rule test_dataset: input: datasets/denv3/tree.json, datasets/denv3/pathogen.json, resources/all/sequences.fasta, datasets/denv3/genome_annotation.gff3, datasets/denv3/README.md, datasets/denv3/CHANGELOG.md output: test_output/denv3 jobid: 0 reason: Missing output files: test_output/denv3 wildcards: serotype=denv3 resources: tmpdir=/var/folders/3_/0vmyf52s7dvdr36h6nlvwt7r0000gp/T nextclade run --input-dataset datasets/denv3 --output-all test_output/denv3 --silent resources/all/sequences.fasta Error: 0: When preprocessing Nextclade graph 1: When retrieving aa mutations from reference tree node NODE_0000005 2: When preprocessing reference tree node NODE_0000005: amino acid mutation C:I108M is outside of the peptide C (length 100). This is likely an inconsistency between reference tree, reference sequence, and genome annotation in the Nextclade dataset Location: packages/nextclade/src/tree/tree_preprocess.rs:203 Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it. Run with RUST_BACKTRACE=full to include source snippets. ```
nextstrain build nextclade test_output/denv4
view error for dengue/denv4 -> fixed by C or M coords ```bash [Sun May 26 06:10:37 2024] rule test_dataset: input: datasets/denv4/tree.json, datasets/denv4/pathogen.json, resources/all/sequences.fasta, datasets/denv4/genome_annotation.gff3, datasets/denv4/README.md, datasets/denv4/CHANGELOG.md output: test_output/denv4 jobid: 0 reason: Missing output files: test_output/denv4 wildcards: serotype=denv4 resources: tmpdir=/var/folders/3_/0vmyf52s7dvdr36h6nlvwt7r0000gp/T nextclade run --input-dataset datasets/denv4 --output-all test_output/denv4 --silent resources/all/sequences.fasta Error: 0: When preprocessing Nextclade graph 1: When retrieving aa mutations from reference tree node NODE_0000001 2: Encountered a mutation (S2H) in reference tree branch attributes, for which the origin state of the mutation is inconsistent with the state at the parental branch. Mutations origin state is 'S', but tree (inferred from the reference sequence as no prior mutations were observed at this position) has state 'V'. This is likely an inconsistency between reference tree and reference sequence in the Nextclade dataset. Reference sequence should either correspond to the root of the reference tree or the root of the reference tree needs to account for difference between the tree and reference sequence. Please check that your reference tree is consistent with your reference sequence. Location: packages/nextclade/src/tree/tree_preprocess.rs:226 Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it. Run with RUST_BACKTRACE=full to include source snippets. ```

Because I fear inadvertently wandering off the path of acceptable solutions, it might be most helpful and efficient for someone(s) with more experience to submit commits to this branch. From the changes, we can have a productive discussion. Please feel free to message me if anyone wants a zipped folder of the results.zip intermediate files.

j23414 commented 3 months ago

After some discussion with a few people, I may move the 'fine-tuning' of the "dengue/all" dataset commits to a new draft PR since we are still testing solutions.

This approach allows us to merge a functional workflow for assembling a Nextclade dataset, providing a base from which we can test different solutions. @joverlee521, this scoped PR is ready for review

j23414 commented 3 months ago

if we can just drop nextclade/datasets/

Yes, I wondered that as well. But then decided to keep it as a foundation for a "fine-tuning" PR or for others who might want to create separate branches to explore different solutions from the existing dataset.

My plan is to delete this when https://github.com/nextstrain/nextclade_data/pull/203 is finalized and merged.