Closed cademirch closed 2 years ago
Hi Cade,
This is excellent, thanks for the hard work! I will review in more detail over the next couple of days, but a few quick comments right now:
1) The freebayes workflow is more or less depreciated and I don't think it should be a high priority. While ultimately it would be sensible to support alternate SNP callers, the workhorse for all the analysis we intend will be the GATK pipeline so we should focus on that. This will also make it easier to run fastq -> vcf as we don't need to worry about handling differences between GATK and FreeBayes outputs and setup in one rule.
2) @sjswuitchik is also working on testing so you two should collaborate so as to avoid duplicating effort. It might make sense to work on this in a branch of this repo instead of a fork, as that will make collaborating a bit more straightforward.
3) For the fastqs / gzip rule, part of the logic here is that the fastqs won't always be deleted. For example if someone is running on local data (not from SRA) then they don't want to delete the fastqs. Ideally the same setup should support local and SRA runs without needing anything other than to make sure that your local files are arranged in the correct directory structure for Snakemake to realize everything is present. But this means we need to be really careful about deleting files in the pipeline itself.
More soon
That said I am going to merge this into the restruct_dirs branch of the main repo so we can do some local testing too, and fix any merge conflicts with @sjswuitchik latest bugfixes. I think @cademirch you have push access to this repo so if you find additional bugs you should be able to push directly to this branch.
Hey everyone,
I've made some pretty big changes to the overall organization of the repo, mainly trying to follow the Snakemake best practices.
The actual workflow is still more or less the same. There is now one central
Snakefile
inworkflows/
but currently only produces the fastq2bam files. It seems relatively easy to make this Snakefile go all the way from fastq -> vcf, and I am currently working on that. I have only updated the bam2vcf_gatk workflow to the new organization, I am currently working on the freebayes workflow.All of the workflows can handle multiple organism and reference genomes.
There is a new file,
rules/common.smk
that contains some input functions for the big aggregation rules. This seems to be the Snakemake recommended way.I have updated the various functions in
workflow/helperFun.py
to output files with the correct path and filename to follow the organization restructure.Added a 'tmp_dir' field in
config.yml
to specify temp directory for fasterq-dumpStill todo:
I have also been testing this workflow on different datasets on our own local servers and have run into some issues that would be helpful for users in the future:
I think this covers everything, let me know if you have any questions or concerns. Would appreciate testing and/or feedback.