Closed cademirch closed 2 years ago
One solution here may be a helper script to organize everything properly. This is what we ended up doing with the old Python-based pipeline; it is a little clunky but not unmanageable. Ideally the helper script would also create the proper sample sheets for the user, perhaps taking fixed values as command line parameters, or reading from a very simple config file.
An alternate solution would to be add some if-then logic to the Snakemake pipeline such that if the fastq files are specified as a full path instead of an accession, use the file at that path, and if the genome is specified as a full path rather than an accession, use that as the reference.
Not sure which would be easier for the user and/or easier to code and maintain.
Currently the workflow is setup such that users supply a csv sheet specifying the samples and associated SRRs and reference. This makes running the workflow on public datasets very easy. However, if a user wanted to run the workflow on their own data it is not straightforward. I've made the workflow run on my own data, but it was definitely not straightforward. I had to organize the reads into the directory structure that Snakemake would have made had it downloaded the reads for me, so that the workflow would run everything after the downloading step.
I still haven't come up with a solution I like to this. Would appreciate ideas/thoughts.