malariagen / pipelines

Pipelines for processing malaria parasite and mosquito genome sequence data.
MIT License
14 stars 13 forks source link

Cohort VCF to Zarr #72

Closed alimanfoo closed 3 years ago

alimanfoo commented 3 years ago

This PR adds a cohort_vcf_to_zarr script for converting a multi-sample VCF file to zarr. Example usage with VCF from output of phasing pipeline:

$ python cohort_vcf_to_zarr.py \
       --input fixture/phased.vcf \
       --output output/phased.zarr \
       --contig 2R \
       --field variants/POS \
       --field variants/REF:S1 \
       --field variants/ALT:S1 \
       --field variants/AC \
       --field variants/AF \
       --field variants/CM \
       --field calldata/GT \
       --alt-number 1

Also includes moving the sample_vcf_to_zarr script into its own directory for consistency of file organisation.

Work towards #44.

gbggrant commented 3 years ago

Could you add an option to zip up the zarr? I think it's easier to output single (zip) files from a cromwell workflow than globbing the zarr tree structure.

alimanfoo commented 3 years ago

Could you add an option to zip up the zarr? I think it's easier to output single (zip) files from a cromwell workflow than globbing the zarr tree structure.

Hi @gbggrant, we actually would rather not zip up the zarr, with a larger multi-sample callset like this it is much easier to work with unzipped, and we likely will copy it straight up to GCS as-is. Is there a workaround to make Cromwell happy with this kind of output?

P.S., season's greetings.

gbggrant commented 3 years ago

@alimanfoo I'll have another look at this. I'm doing something wrong right now, so wasn't getting anything with my glob.

Have a peaceful holiday!