nextstrain / cli

The Nextstrain command-line interface (CLI)—a program called nextstrain—which aims to provide a consistent way to run and visualize pathogen builds and access Nextstrain components like Augur and Auspice across computing environments such as Docker, Conda, and AWS Batch.
https://docs.nextstrain.org/projects/cli/
MIT License
27 stars 20 forks source link

aws-batch: support Snakemake `--report` #373

Closed joverlee521 closed 3 months ago

joverlee521 commented 3 months ago

Context

Snakemake has removed the --stats option in v8, so I'm looking into the --report option for long term workflow stats.

The Snakemake report must be generated after the workflow has finished. I thought this would be as simple as attaching/downloading an old AWS Batch job then running nextstrain build . --report.

When I did this for ncov-ingest, I saw a bunch of warnings along the lines of:

Missing metadata for file data/gisaid/metadata.tsv. Maybe metadata was deleted or it was created using an older version of Snakemake. This is a non critical warning.

I then realized we are explicitly excluding Snakemake state in the downloads from AWS Batch:

https://github.com/nextstrain/cli/blob/8ed779c9741da868341ca4518e8eff83ffba8e60/nextstrain/cli/runner/aws_batch/s3.py#L113-L124

Possible solutions

  1. Include .snakemake/metadata in the downloads from AWS Batch so that users can generate the Snakemake report locally.
  2. Automatically generate the Snakemake report within the AWS Batch job so that users can download the rendered report

[2] definitely seems like the nicer option and maybe should be applied across all runtimes for nextstrain build?

tsibley commented 3 months ago

Hmm. Downloading the Snakemake state locally may fix this problem, but it can/will cause other problems. I don't know if it'd be ok if we scope it down to not all of .snakemake/ but just .snakemake/metadata/ as you suggest in 1.

What's the effect of the warnings? Is there useful information missing from the report? Or is just wanting to suppress the noise from Snakemake?

joverlee521 commented 3 months ago

What's the effect of the warnings? Is there useful information missing from the report? Or is just wanting to suppress the noise from Snakemake?

The generated report does not include any runtime info:

Screenshot 2024-06-17 at 2 21 33 PM

tsibley commented 3 months ago

Ah, looking more closely at the contents of .snakemake/metadata/, I do think we want to start downloading it by default. It's mostly info used to determine if Snakemake needs to re-run rules based on inputs/outputs, and thus is akin to the file mtimes which we already preserve on download.

tsibley commented 3 months ago
  1. Include .snakemake/metadata in the downloads from AWS Batch so that users can generate the Snakemake report locally.

We should do this, per above. I'll open a PR.

2. Automatically generate the Snakemake report within the AWS Batch job so that users can download the rendered report

[2] definitely seems like the nicer option and maybe should be applied across all runtimes for nextstrain build?

We could also do this as well, but it requires a little more consideration about how/where/when. Would you open it as a separate issue if you'd like to see it?

tsibley commented 3 months ago

This will also need a new docker-base image, as the same exclusions of .snakemake/ are recapitulated there:

https://github.com/nextstrain/docker-base/blob/ccac0787cbc6118d0e39518e2220f650e121b8bd/entrypoint-aws-batch#L43-L44

joverlee521 commented 3 months ago
  1. Automatically generate the Snakemake report within the AWS Batch job so that users can download the rendered report [2] definitely seems like the nicer option and maybe should be applied across all runtimes for nextstrain build?

We could also do this as well, but it requires a little more consideration about how/where/when. Would you open it as a separate issue if you'd like to see it?

Hmm, maybe this doesn't need to be built into the Nextsrain CLI. It could just be a separate step in the pathogen-repo-build workflow so we have reports for our automated pathogens.

tsibley commented 3 months ago

It could just be a separate step in the pathogen-repo-build workflow so we have reports for our automated pathogens.

Totally.