nextstrain / dengue

Nextstrain build for dengue virus
https://nextstrain.org/dengue
8 stars 10 forks source link

Copy ingest #13

Closed j23414 closed 9 months ago

j23414 commented 11 months ago

Description of proposed changes

Splitting out dengue ingest changes PR#6 into smaller, more focused pull requests. Especially since there have been significant improvements to linking reusable pathogen ingest scripts.

The primary scope of this PR includes:

  1. Copying the Monkeypox ingest directory.
  2. Using git subtree to copy and link the reusable scripts in an ingest/vendored subdirectory.
  3. Temporarily removing Nextclade-related rules, pending the compilation of a Nextclade dengue dataset and potential v3 changes.
  4. Pulling and processing one "sequences.fasta" and "metadata.tsv" file pair.

To ease development and review, tasks that are not part of this PR but will be submitted as future PRs include:

  1. Splitting the fetch process into dengue serotypes (denv1-4).
  2. Adding dengue-specific annotations and data fixes.
  3. Integrating Nextclade-related rules and datasets.

Related issue(s)

Checklist

git clone https://github.com/nextstrain/dengue.git
cd dengue
git checkout cp_ingest
cd ingest
nextstrain build . data/metadata.tsv data/sequences.fasta
j23414 commented 10 months ago

I had to install ncbi-datasets-cli

Ah I see what happened. It appears we switched from NCBI Virus to NCBI Datasets on Sept 11, 2023 and the ingest README here and in Monkeypox need to be updated. I'll ensure that the ingest README for Monkeypox is updated first, followed by a rebase of this PR. Additionally, I'll submit a PR to update the ambient documentation.

Error: Internal error (invalid zip archive)

I've encountered this error during testing in the Monkeypox repository as well. It seems to be a known intermittent issue with ncbi datasets . To mitigate this, we have a retries: 5 directive in the Snakemake rule.

(For documentation: The copy-and-pasting mpx ingest approach serves several benefits, including maintaining consistency across pathogen repositories, facilitating focused review on new changes, and providing a method to highlight changes that need to be propagated to other repositories.)