malariagen / pipelines

Pipelines for processing malaria parasite and mosquito genome sequence data.
MIT License
14 stars 13 forks source link

Update sample_vcf_to_zarr.py #94

Closed GvandeSteeg closed 2 years ago

GvandeSteeg commented 3 years ago

Allow contigs to be read from VCF by default, preserving backwards compatibility with existing scripts

GvandeSteeg commented 3 years ago

@alimanfoo I'm not able to directly assign reviewers, would you be able to have a look?

alimanfoo commented 3 years ago

Hi @GvandeSteeg, very sorry for the slow response. IIUC this PR currently discovers the contigs by making a full scan through the VCF. This will be pretty slow, probably take longer than the VCF to zarr conversion, and you'll be repeating it for every single sample even though all samples have the same set of contigs.

The sample VCFs should have the contigs in the header, you could discover them by reading just the headers?

alimanfoo commented 3 years ago

cc @amakunin

GvandeSteeg commented 3 years ago

Hi @alimanfoo, I agree with your concern regarding the samples. Unfortunately, it doesn't appear that the contigs are in the headers of the vcfs we generate from the sorted fasta files, as they look like this:

##fileformat=VCFv4.2
##reference=/lustre/scratch118/infgen/pathdev/kp11/MALARIAGEN/INPUT_TEMPLATE_CREATION/idAnoFuneDA_1.curated_primary.fa
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO

Perhaps I've been approaching this wrong. I think we still have to loop over the vcf, unfortunately, but I'll move it to a separate script so that we only have to run it once per vcf, instead of per sample

edit: This might be on us actually. Looks like we're not generating the header currently. I'll fix that asap

GvandeSteeg commented 3 years ago

I've updated our internal script that generates the VCFs from fasta files here. The test data at least already matches the VCF headers proposed, so either should work with this new solution

amakunin commented 3 years ago

@alimanfoo - can you review this? The proposed changes look sensible to me

alimanfoo commented 3 years ago

Hi all, again sorry for being slow. This looks fine, although it won't work on gzipped or bgzipped files. Do you want to support that?

Btw seeking back to the beginning of file shouldnt be needed, as file is closed after exiting context manager block.

alimanfoo commented 2 years ago

Hi @GvandeSteeg, the implementation looks good to me. There is a test module test_sample_vcf_to_zarr.py which would be worth running. It would be possible to add a test function to check behaviour when no --contig is provided, but up to you if you want to do that or would rather merge so you can move forward with other things, as you prefer.

GvandeSteeg commented 2 years ago

Alright, thanks for that! I've modified the test to test this new functionality and made the necessary changes to the fixtures as needed.

Docker logs:

docker run -it alistair python -- test*
[vcf_to_zarr] 2 rows in 0.00s; chunk in 0.00s (423 rows/s)
[vcf_to_zarr] all done (86 rows/s)
[vcf_to_zarr] 6 rows in 0.00s; chunk in 0.00s (1902 rows/s)
[vcf_to_zarr] all done (307 rows/s)

.[vcf_to_zarr] 2 rows in 0.01s; chunk in 0.01s (374 rows/s)
[vcf_to_zarr] all done (80 rows/s)
[vcf_to_zarr] 6 rows in 0.00s; chunk in 0.00s (1850 rows/s)
[vcf_to_zarr] all done (291 rows/s)
[vcf_to_zarr] all done (0 rows/s)

.[vcf_to_zarr] 2 rows in 0.00s; chunk in 0.00s (459 rows/s)
[vcf_to_zarr] all done (85 rows/s)
[vcf_to_zarr] 6 rows in 0.00s; chunk in 0.00s (1799 rows/s)
[vcf_to_zarr] all done (273 rows/s)

.
----------------------------------------------------------------------
Ran 3 tests in 5.153s

OK

If you would be so kind to officially Approve then we can close this PR :)

alimanfoo commented 2 years ago

Brilliant, thanks so much @GvandeSteeg!