harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

Support Snakemake>8.0 #170

Closed cademirch closed 2 months ago

cademirch commented 2 months ago

This PR includes changes to support Snakemake > 8. This will make snpArcher no longer compatible with Snakemake <8 Edit: I've added backwards compatibility with snakemake 7. This also includes some refactoring of how the sample sheet is parsed, addition of refgenome config options, removal of the resources config in favor of profiles, and finally documentation updates.

Reference genome config option:

cademirch commented 2 months ago

I can't seem to get pyd4 to build in the github actions. I spent awhile making my own snakemake docker with python 3.11 to no avail. Will probably just turn off cov filter for the tests. It seems pyd4 is not actively maintained anymore. @tsackton

tsackton commented 2 months ago

I think we should probably look at dumping the pyd4 dependency as it is the cause of a lot of trouble.

cademirch commented 2 months ago

I think we should probably look at dumping the pyd4 dependency as it is the cause of a lot of trouble.

Yeah, I agree. What would it take to do so? If I remember correctly, the create_cov_bed script didn't always use d4tools? We can also look into using mosdepth for callable bed: . https://github.com/brentp/mosdepth?tab=readme-ov-file#callable-regions-example

tsackton commented 2 months ago

I think that we probably would want to rethink our coverage calculations to drop pyd4. Currently we compute per-base coverage for each sample, and then average across samples to get an average coverage for each base, and then filter based on that. The "compute the average coverage across samples" step is very slow using bedgraphs or bigwigs, but was quite fast with pyd4.

But, we might be able to get rid of that step. One simple option would be to merge bams and then compute per-base coverage from the merged bam, and just divide by the number of samples to get mean coverage for each position. There are probably other options as well, just need to think on it a bit.

cademirch commented 2 months ago

Update on pyd4. The issue is in building the d4 rust crate, specifically their build_htslib step which is missing a -L flag in their curl command. See this PR for more info. To get around this, I made a .curlrc file in my homedir and added the option -L to it. This makes curl use -L by default, which then allows the build to complete.

Not sure if/when they will merge my PR, so I'll think about how to incorporate the .curlrc workaround into snpArcher. It could also be a documentation thing.

erikenbody commented 2 months ago

if we make that rule a shell command, could we just append it to .curlrc? e.g. something like

echo -L >> ~/.curlrc
pip install pyd4
pyrhon3 create_bed.py
cademirch commented 2 months ago

I think I'll just add it as python code to common.smk so that it runs when the snakefile is run. Not a fan since it will mess with user's curl command, but oh well. Could also set an alias. The thing is we have to make sure that the -L option is set before the cov_filter env is built, which is before any rules are executed

cademirch commented 2 months ago

This should be ready to merge. I've just added backwards compatibility with snakemake 7. @erikenbody @tsackton