sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
23 stars 5 forks source link

Hypothesis testing for vcf2zarr #249

Closed tomwhite closed 3 weeks ago

tomwhite commented 1 month ago

It would be good to use the Hypothesis strategy for generating VCF files that's in sgkit against vcf2zarr, to check for corner cases in conversion.

I wonder if we should move the Hypothesis VCF code to this repo, or release as a separate package (it may be of general interest)?

jeromekelleher commented 1 month ago

I'd be very happy to move it into this repo, but maybe it is actually something of more general use? How much trouble would packaging it separately be?

tomwhite commented 1 month ago

Here's a branch that uses hypothesis-vcf to generate VCFs: https://github.com/sgkit-dev/bio2zarr/compare/main...tomwhite:bio2zarr:hypothesis-vcf-tests

It's been passing for ~1000s of generated examples, which gives me confidence that vcf2zarr is handling lots of edge cases. But I just ran it again and it found a failing VCF which needs looking into. Perhaps we should run it as a separate GitHub Action workflow, or maybe even manually for the moment.

I've also had to modify the VCF generating code (currently in sgkit), so that's probably not quite ready to release separately yet either.

tomwhite commented 1 month ago

I'd be very happy to move it into this repo, but maybe it is actually something of more general use? How much trouble would packaging it separately be?

I think it would be useful generally, and could be listed on https://hypothesis.readthedocs.io/en/latest/strategies.html. It would need minimal packaging and just a README for documentation I think.

tomwhite commented 1 month ago

I've moved the hypothesis-vcf code into its own repository at https://github.com/tomwhite/hypothesis-vcf.

if that looks OK, I'd like to move it under https://github.com/sgkit-dev.

jeromekelleher commented 1 month ago

LGTM - I think it would be a great addition to sgkit-dev

tomwhite commented 4 weeks ago

The hypothesis-vcf code is now in https://github.com/sgkit-dev/hypothesis-vcf.

Thanks for fixing #251 @jeromekelleher. I've rebased and rerun the code in my branch at https://github.com/tomwhite/hypothesis-vcf and it hasn't found any more problems.

What do you think the next step is? Have a CI job that runs just the hypothesis tests once a day?

jeromekelleher commented 4 weeks ago

What do you think the next step is? Have a CI job that runs just the hypothesis tests once a day?

I'm not sure this would do anything different to just adding a hypothesis job as part of normal CI. If we tune it to run for < 30 seconds and it runs with a different seed each time, it shouldn't get in the way and give us good coverage. We're not expecting it to break, so shouldn't lead to noise for contributors.

tomwhite commented 4 weeks ago

I just had a quick look at the timings, and each call to vcf2zarr.convert is taking just over 1 second - even for these tiny generated files with just a handful of variants and samples. So for the default Hypothesis setting of 200 examples it takes two or three minutes to run the test. We could lower the number of examples it generated, but do you think there's scope for reducing the conversion time?

jeromekelleher commented 4 weeks ago

Is it much better with worker_processes=0?

tomwhite commented 4 weeks ago

Is it much better with worker_processes=0?

Yes! It takes around 30 seconds with the default number of examples. So we could probably just use that.

tomwhite commented 4 weeks ago

Should that be the default if you're not using the CLI?

jeromekelleher commented 4 weeks ago

The issue is that it's using a home-grown syncronous exector to do it (https://github.com/sgkit-dev/bio2zarr/blob/d1920541bf36101d5e3aa7656094901182c0e98b/bio2zarr/core.py#L82 ) which seems to work perfectly well, but is doing stuff the Python docs are explicit about saying "you should only use this for testing". I'm sure it's probably fine, I just wanted to be conservative for now.

tomwhite commented 4 weeks ago

Makes sense.