nextstrain / dengue

Nextstrain build for dengue virus
https://nextstrain.org/dengue
8 stars 10 forks source link

Generalize Ingest #6

Closed j23414 closed 3 months ago

j23414 commented 1 year ago

Description of proposed changes

Ingest data from genbank to generate:

for a dengue build.

Instead of separately pulling denv1 to denv4, all types are combined in one file with an annotated column: (2023-03-25, to avoid confusion keep serotypes separate)

Screen Shot 2022-11-17 at 3 50 30 PM

Unordered list of remaining tasks that may change:

Related issue(s)

Testing

Local Test

Can test this locally by running

git clone https://github.com/nextstrain/dengue.git
cd dengue
git checkout new_ingest
cd ingest
nextstrain build .

ls -ltr data
wc -l data/metadata_all.tsv

May need to install tidyverse (No longer need R since the script was refactored into python)

R
install.packages(tidyverse)
tsibley commented 1 year ago

I haven't done a detailed review, but a couple high-level comments to start off:

  1. I don't think just-in-time downloading of programs from monkeypox is the way we want to reuse stuff. It seems to me that it will be fragile and hard to maintain over time.

  2. While I personally enjoy multi-lingual projects and am sympathetic to the best tools being the ones you know (and do like tidyverse), I think we should replace the lone R program here with something in Python for consistency with the rest of the repo/ecosystem.

    The R program also won't work in our Docker runtime (e.g. try it with nextstrain build --docker ingest/) and if it works in our Conda and ambient runtimes for someone, it's only by chance. We could add R support to our runtimes, but I think that's a bigger scope of work. I also noted that the R program loads the whole metadata file into memory when it could instead do a streaming transform (e.g. as csv-to-ndjson does).

j23414 commented 1 year ago

I haven't done a detailed review, but a couple high-level comments to start off:

  1. I don't think just-in-time downloading of programs from monkeypox is the way we want to reuse stuff. It seems to me that it will be fragile and hard to maintain over time.
  2. While I personally enjoy multi-lingual projects and am sympathetic to the best tools being the ones you know (and do like tidyverse), I think we should replace the lone R program here with something in Python for consistency with the rest of the repo/ecosystem. The R program also won't work in our Docker runtime (e.g. try it with nextstrain build --docker ingest/) and if it works in our Conda and ambient runtimes for someone, it's only by chance. We could add R support to our runtimes, but I think that's a bigger scope of work. I also noted that the R program loads the whole metadata file into memory when it could instead do a streaming transform (e.g. as csv-to-ndjson does).

Point 2 is addressed with https://github.com/nextstrain/dengue/pull/6/commits/2c8553ae4618c42635afefbf96b50aa924a5964a

Re: Point 1, I still disagree...mostly because I'm currently trying to propagate changes across zika and ebola ingests which is already tedious when copying changes across their snakefiles. One might argue to only polish ingest scripts on one repo (dengue), however in my experience that results in over-specialized scripts that become really difficult to generalize later.

To meet the final end-goal of Point 1 (if not the immediate developmental path of Point 1), I'm happy to make final copies in a future PR after I'm sure the pipeline works for:

Otherwise my dev path has already branched into:

tsibley commented 1 year ago

@j23414 Nod. I don't want to get in the way of your active development.

As a final state, though, I still think we should not rely on this sort of dynamic downloading. It's got all the problems of a package management system (dependencies, versioning, updates, etc), without any of the solutions a mature package management system provides.

There's also issues of brittle close coupling it introduces. Consider that someone changing files in the monkeypox workflow would (rightfully so, I'd argue) not think that modifying one of the programs in scripts/ could break the unrelated dengue workflow.

j23414 commented 1 year ago

Working on rebasing this! Do not review yet! Thanks @victorlin!!!

j23414 commented 3 months ago

This PR superseded by merged PRs: