Closed tomwhite closed 3 weeks ago
I'd be very happy to move it into this repo, but maybe it is actually something of more general use? How much trouble would packaging it separately be?
Here's a branch that uses hypothesis-vcf to generate VCFs: https://github.com/sgkit-dev/bio2zarr/compare/main...tomwhite:bio2zarr:hypothesis-vcf-tests
It's been passing for ~1000s of generated examples, which gives me confidence that vcf2zarr is handling lots of edge cases. But I just ran it again and it found a failing VCF which needs looking into. Perhaps we should run it as a separate GitHub Action workflow, or maybe even manually for the moment.
I've also had to modify the VCF generating code (currently in sgkit), so that's probably not quite ready to release separately yet either.
I'd be very happy to move it into this repo, but maybe it is actually something of more general use? How much trouble would packaging it separately be?
I think it would be useful generally, and could be listed on https://hypothesis.readthedocs.io/en/latest/strategies.html. It would need minimal packaging and just a README for documentation I think.
I've moved the hypothesis-vcf code into its own repository at https://github.com/tomwhite/hypothesis-vcf.
if that looks OK, I'd like to move it under https://github.com/sgkit-dev.
LGTM - I think it would be a great addition to sgkit-dev
The hypothesis-vcf code is now in https://github.com/sgkit-dev/hypothesis-vcf.
Thanks for fixing #251 @jeromekelleher. I've rebased and rerun the code in my branch at https://github.com/tomwhite/hypothesis-vcf and it hasn't found any more problems.
What do you think the next step is? Have a CI job that runs just the hypothesis tests once a day?
What do you think the next step is? Have a CI job that runs just the hypothesis tests once a day?
I'm not sure this would do anything different to just adding a hypothesis job as part of normal CI. If we tune it to run for < 30 seconds and it runs with a different seed each time, it shouldn't get in the way and give us good coverage. We're not expecting it to break, so shouldn't lead to noise for contributors.
I just had a quick look at the timings, and each call to vcf2zarr.convert
is taking just over 1 second - even for these tiny generated files with just a handful of variants and samples. So for the default Hypothesis setting of 200 examples it takes two or three minutes to run the test. We could lower the number of examples it generated, but do you think there's scope for reducing the conversion time?
Is it much better with worker_processes=0
?
Is it much better with
worker_processes=0
?
Yes! It takes around 30 seconds with the default number of examples. So we could probably just use that.
Should that be the default if you're not using the CLI?
The issue is that it's using a home-grown syncronous exector to do it (https://github.com/sgkit-dev/bio2zarr/blob/d1920541bf36101d5e3aa7656094901182c0e98b/bio2zarr/core.py#L82 ) which seems to work perfectly well, but is doing stuff the Python docs are explicit about saying "you should only use this for testing". I'm sure it's probably fine, I just wanted to be conservative for now.
Makes sense.
It would be good to use the Hypothesis strategy for generating VCF files that's in sgkit against vcf2zarr, to check for corner cases in conversion.
I wonder if we should move the Hypothesis VCF code to this repo, or release as a separate package (it may be of general interest)?