Organization refactor - Githubissues

cademirch commented 2 years ago

Hey everyone,

I've made some pretty big changes to the overall organization of the repo, mainly trying to follow the Snakemake best practices.

The actual workflow is still more or less the same. There is now one central Snakefile in workflows/ but currently only produces the fastq2bam files. It seems relatively easy to make this Snakefile go all the way from fastq -> vcf, and I am currently working on that. I have only updated the bam2vcf_gatk workflow to the new organization, I am currently working on the freebayes workflow.
All of the workflows can handle multiple organism and reference genomes.
There is a new file, rules/common.smk that contains some input functions for the big aggregation rules. This seems to be the Snakemake recommended way.
I have updated the various functions in workflow/helperFun.py to output files with the correct path and filename to follow the organization restructure.
Added a 'tmp_dir' field in config.yml to specify temp directory for fasterq-dump

Still todo:

Make a good small, representative test set, as well as test functions
Finish Freebayes workflow
Other things? Please let me know!

I have also been testing this workflow on different datasets on our own local servers and have run into some issues that would be helpful for users in the future:

Fasterq-dump seems to fail for no reason when you submit multiple jobs at once, should use --rerun-incomplete CLI option
Fasterq-dump uses a lot of disk space (2x the actual size of the reads), thus I have been marking the fastq.gz's as temp when I run the workflow, as well as using the --batch option. Also, I give fasterq-dump more threads so that Snakemake doesn't spawn to many downloads which can cause the filesystem to fill up.
This is more of a suggestion, but I'm not sure if we need to keep the gzip fastq rule as it only adds time (especially on large read sets) if the reads are going to be deleted once a BAM is made.

I think this covers everything, let me know if you have any questions or concerns. Would appreciate testing and/or feedback.

tsackton commented 2 years ago

Hi Cade,

This is excellent, thanks for the hard work! I will review in more detail over the next couple of days, but a few quick comments right now:

1) The freebayes workflow is more or less depreciated and I don't think it should be a high priority. While ultimately it would be sensible to support alternate SNP callers, the workhorse for all the analysis we intend will be the GATK pipeline so we should focus on that. This will also make it easier to run fastq -> vcf as we don't need to worry about handling differences between GATK and FreeBayes outputs and setup in one rule.

2) @sjswuitchik is also working on testing so you two should collaborate so as to avoid duplicating effort. It might make sense to work on this in a branch of this repo instead of a fork, as that will make collaborating a bit more straightforward.

3) For the fastqs / gzip rule, part of the logic here is that the fastqs won't always be deleted. For example if someone is running on local data (not from SRA) then they don't want to delete the fastqs. Ideally the same setup should support local and SRA runs without needing anything other than to make sure that your local files are arranged in the correct directory structure for Snakemake to realize everything is present. But this means we need to be really careful about deleting files in the pipeline itself.

More soon

tsackton commented 2 years ago

That said I am going to merge this into the restruct_dirs branch of the main repo so we can do some local testing too, and fix any merge conflicts with @sjswuitchik latest bugfixes. I think @cademirch you have push access to this repo so if you find additional bugs you should be able to push directly to this branch.

harvardinformatics / snpArcher

Organization refactor #8