Closed GvandeSteeg closed 2 years ago
@alimanfoo I'm not able to directly assign reviewers, would you be able to have a look?
Hi @GvandeSteeg, very sorry for the slow response. IIUC this PR currently discovers the contigs by making a full scan through the VCF. This will be pretty slow, probably take longer than the VCF to zarr conversion, and you'll be repeating it for every single sample even though all samples have the same set of contigs.
The sample VCFs should have the contigs in the header, you could discover them by reading just the headers?
cc @amakunin
Hi @alimanfoo, I agree with your concern regarding the samples. Unfortunately, it doesn't appear that the contigs are in the headers of the vcfs we generate from the sorted fasta files, as they look like this:
##fileformat=VCFv4.2
##reference=/lustre/scratch118/infgen/pathdev/kp11/MALARIAGEN/INPUT_TEMPLATE_CREATION/idAnoFuneDA_1.curated_primary.fa
#CHROM POS ID REF ALT QUAL FILTER INFO
Perhaps I've been approaching this wrong. I think we still have to loop over the vcf, unfortunately, but I'll move it to a separate script so that we only have to run it once per vcf, instead of per sample
edit: This might be on us actually. Looks like we're not generating the header currently. I'll fix that asap
I've updated our internal script that generates the VCFs from fasta files here. The test data at least already matches the VCF headers proposed, so either should work with this new solution
@alimanfoo - can you review this? The proposed changes look sensible to me
Hi all, again sorry for being slow. This looks fine, although it won't work on gzipped or bgzipped files. Do you want to support that?
Btw seeking back to the beginning of file shouldnt be needed, as file is closed after exiting context manager block.
Hi @GvandeSteeg, the implementation looks good to me. There is a test module test_sample_vcf_to_zarr.py
which would be worth running. It would be possible to add a test function to check behaviour when no --contig
is provided, but up to you if you want to do that or would rather merge so you can move forward with other things, as you prefer.
Alright, thanks for that! I've modified the test to test this new functionality and made the necessary changes to the fixtures as needed.
Docker logs:
docker run -it alistair python -- test*
[vcf_to_zarr] 2 rows in 0.00s; chunk in 0.00s (423 rows/s)
[vcf_to_zarr] all done (86 rows/s)
[vcf_to_zarr] 6 rows in 0.00s; chunk in 0.00s (1902 rows/s)
[vcf_to_zarr] all done (307 rows/s)
.[vcf_to_zarr] 2 rows in 0.01s; chunk in 0.01s (374 rows/s)
[vcf_to_zarr] all done (80 rows/s)
[vcf_to_zarr] 6 rows in 0.00s; chunk in 0.00s (1850 rows/s)
[vcf_to_zarr] all done (291 rows/s)
[vcf_to_zarr] all done (0 rows/s)
.[vcf_to_zarr] 2 rows in 0.00s; chunk in 0.00s (459 rows/s)
[vcf_to_zarr] all done (85 rows/s)
[vcf_to_zarr] 6 rows in 0.00s; chunk in 0.00s (1799 rows/s)
[vcf_to_zarr] all done (273 rows/s)
.
----------------------------------------------------------------------
Ran 3 tests in 5.153s
OK
If you would be so kind to officially Approve
then we can close this PR :)
Brilliant, thanks so much @GvandeSteeg!
Allow contigs to be read from VCF by default, preserving backwards compatibility with existing scripts