nextflow-io / nf-prov

Apache License 2.0
23 stars 11 forks source link

Infer source URLs for remote paths #5

Closed bentsherman closed 10 months ago

bentsherman commented 11 months ago

Workflow inputs derived from the pipeline repository are provided to nf-prov as local paths, e.g. ~/.nextflow/assets/nextflow-io/rnaseq-nf/multiqc. This code is used to convert these paths into the source URL (e.g. https://github.com/nextflow-io/rnaseq-nf/tree/d910312506c6539365ed70aacda5068dea9152dd/multiqc) which is much more user-friendly.

Currently only Github URLs are supported, so the support should be extended to the other major git providers:

bentsherman commented 11 months ago

Remote input files outside of the pipeline repo are staged into a special "stage" directory in the work directory.

For example, consider nf-core/rnaseq -profile test, which downloads some files from Github over HTTP:

https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/reference/genome.fasta

This URL is correctly shown in the params (parametric_domain) but not in the workflow inputs (io_domain) or task inputs:

work/stage-0e2cf8f5-c488-4846-8ee5-e47381d6ab73/35/b49c5eb3bfee405eb379e12c4580a3/genome.fasta

The formula is work/stage-${sessionId}/${unique}/${hash}/${sourceName}, where the unique hash is derived from the source path, stage directory (work/stage-${sessionId}), and an incrementing index that is used to avoid any further hash collisions (e.g. multiple tasks download the same file at the same time).

So I don't think we can recover the source URL from the stage path alone, but the source URL will be somewhere in the params. We could try to match the file by base name, e.g. genome.fasta in the above example, which will work except when there are workflow inputs with the same base name under different directories.

Another approach would be to try to reproduce the stage directories from params:

  1. take each param that looks like a remote file
  2. generate the stage path for the file (see FilePorter.Batch::getCachePathFor())
  3. replace any occurrences of the stage path with the source URL