snakemake / snakemake-wrappers

This is the development home of the Snakemake wrapper repository, see
https://snakemake-wrappers.readthedocs.io
216 stars 186 forks source link

wrappers for hap.py and dipcall #294

Open nate-d-olson opened 3 years ago

nate-d-olson commented 3 years ago

Is your feature request related to a problem? Please describe. Difficulty developing an assembly benchmarking pipeline due to the dependencies associated with two key bioinformatic tools use in the pipeline hap.py and dipcall. I know there is a snakemake wrapper for the hap.py pre.py command but we are running into a regex issue when running the hap.py bioconda package in the snakemake/snakemake docker container.

Describe the solution you'd like We would like to have snakemake wrappers for hap.py, https://github.com/Illumina/hap.py, and dipcall, https://github.com/lh3/dipcall.

Describe alternatives you've considered Tried using docker containers to handle dependencies but we were unable to get the pipeline to work on both macOS and Linux. Tried using the hap.py bioconda package but ran into a regex error when run in the snakemake/snakemake container. There is an issue on the hap.py github repo related to the error, https://github.com/Illumina/hap.py/issues/66, which seems to be related to specific gcc versions.

Additional context As part of the Genome in a Bottle project, https://www.nist.gov/programs-projects/genome-bottle we are working to improve the usability and interpretability of our methods for benchmarking variant calls. Hap.py is used for small variant benchmarking and is the reference implementation of the GA4GH best practices for small variant benchmarking. dipcall is used to generate variant calls from a diploid assembly. We developed a proof of concept pipeline using snakemake, https://github.com/usnistgov/giab-asm-benchmarking, but the pipeline only works on macOS due to a hack we used to handle the hap.py and dipcall dependencies.

mbargull commented 3 years ago

Hi @nate-d-olson,

Tried using the hap.py bioconda package but ran into a regex error when run in the snakemake/snakemake container. There is an issue on the hap.py github repo related to the error, Illumina/hap.py#66, which seems to be related to specific gcc versions.

It looks like version 0.3.12 of hap.py fixed the regex issue but we only got that version into Bioconda a couple of months after its release and Snakemake wrapper creation. I'll relay the need for the updated wrapper right away. And please feel free to let us know of any further issue you might encounter after we get the updated wrapper up!

Cheers, Marcel

johanneskoester commented 3 years ago

There is another fix needed upstream, see https://github.com/Illumina/hap.py/pull/113

johanneskoester commented 3 years ago

295

johanneskoester commented 3 years ago

Ok, first PR is merged, hap.py should now work fine for you.

johanneskoester commented 3 years ago

@mbargull has a look at dipcall now.

nate-d-olson commented 3 years ago

Thanks for your responsiveness and for working to update the hap.py bioconda package and hap.py/pre.py wrapper. We use the main hap.py function, which does the comparison and calculates performance metrics, in our pipeline. I tried to run the main hap.py command, hap.py {TRUTH}.vcf.gz {QUERY}.vcf.gz -f {TRUTH REGIONS}.bed -r {REF}.fa --engine=vcfeval --stratification {STRATIFICATIONS}.tsv -o {OUTDIR}, with the new conda package and the snakemake/snakemake docker container and get the following error, likely introduced with the recent update to hap.py (https://github.com/Illumina/hap.py/pull/113).

2021-03-03 16:37:26,577 WARNING  No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
Hap.py 
2021-03-03 16:37:26,700 ERROR    'Namespace' object has no attribute 'convert_gvcf_truth'
2021-03-03 16:37:26,700 ERROR    Traceback (most recent call last):
2021-03-03 16:37:26,701 ERROR      File "/opt/conda/envs/happy/bin/hap.py", line 529, in <module>
2021-03-03 16:37:26,701 ERROR        main()
2021-03-03 16:37:26,701 ERROR      File "/opt/conda/envs/happy/bin/hap.py", line 286, in main
2021-03-03 16:37:26,701 ERROR        if args.convert_gvcf_truth:
2021-03-03 16:37:26,701 ERROR    AttributeError: 'Namespace' object has no attribute 'convert_gvcf_truth'
johanneskoester commented 3 years ago

They should really have a test framework. Yes, the main has to be adapted as well.

nate-d-olson commented 3 years ago

Here is my first attempt at making a wrapper for the main hap.py command. https://github.com/nate-d-olson/snakemake-wrappers/commit/34e7de271a9cc14546f51436d9f8986d04935813

I had a few questions regarding recommendations/ best practices for creating wrappers.

  1. Defining default values for a variable
  2. Recommendations for when the output is a directory, or should all expected output files be explicitly defined.
  3. Recommendation for when input is a directory (here a directory of bed files is used to define the stratified inputs)
  4. Suggested file size for test data.
johanneskoester commented 3 years ago

Thanks for asking, I've added comments on your commit.

nate-d-olson commented 3 years ago

@jafors and @johanneskoester thank you for your help with the hap.py bioconda packages and snakemake-wrapper. Now that we have a dipcall bioconda package I wanted to start working on the dipcall snakemake-wrapper. Running dipcall is a two-step process; first run-dipcall generates a makefile with the dipcall pipeline, then make is used to execute the pipeline. See example command from the dipcall README below, https://github.com/lh3/dipcall. Should the dipcall wrapper perform both steps, have separate wrappers for both step with a meta-wrapper combining the two, or something else?

dipcall.kit/run-dipcall prefix hs38.fa pat.fa.gz mat.fa.gz > prefix.mak
# or for male, requiring PAR regions in BED
# dipcall.kit/run-dipcall -x dipcall.kit/hs38.PAR.bed prefix hs38.fa pat.fa.gz mat.fa.gz > prefix.mak
make -j2 -f prefix.mak
johanneskoester commented 3 years ago

So, I'd say, if the makefile generation is super quick, just put both in one wrapper. If it takes time, while being less parallel than the second step, use two wrappers, and ideally add a meta-wrapper for their combination.

github-actions[bot] commented 1 week ago

This issue was marked as stale because it has been open for 6 months with no activity.