harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

Make it easier for users to supply own reads and reference #21

Closed cademirch closed 2 years ago

cademirch commented 2 years ago

Currently the workflow is setup such that users supply a csv sheet specifying the samples and associated SRRs and reference. This makes running the workflow on public datasets very easy. However, if a user wanted to run the workflow on their own data it is not straightforward. I've made the workflow run on my own data, but it was definitely not straightforward. I had to organize the reads into the directory structure that Snakemake would have made had it downloaded the reads for me, so that the workflow would run everything after the downloading step.

I still haven't come up with a solution I like to this. Would appreciate ideas/thoughts.

tsackton commented 2 years ago

One solution here may be a helper script to organize everything properly. This is what we ended up doing with the old Python-based pipeline; it is a little clunky but not unmanageable. Ideally the helper script would also create the proper sample sheets for the user, perhaps taking fixed values as command line parameters, or reading from a very simple config file.

An alternate solution would to be add some if-then logic to the Snakemake pipeline such that if the fastq files are specified as a full path instead of an accession, use the file at that path, and if the genome is specified as a full path rather than an accession, use that as the reference.

Not sure which would be easier for the user and/or easier to code and maintain.