Optimise GCP VcfToZarr and generalise to handle other vector species with different contigs

malariagen / pipelines

Pipelines for processing malaria parasite and mosquito genome sequence data.

MIT License

14 stars 13 forks source link

Optimise GCP VcfToZarr and generalise to handle other vector species with different contigs #100

Closed alimanfoo closed 2 years ago

alimanfoo commented 2 years ago

Running the SNP genotyping pipeline on Terra I noticed that the VcfToZarr task is taking much longer than expected.

Looking at the logs, some contigs are taking a long time to parse the first chunk. This would be expected if no index file is being used, so all previous rows have to be scanned to find the first row for a given contig.

This PR modifies the VcfToZarr task to include the VCF index file as an input, and to use the bgzipped VCF file.

Also the contig options were previously hard-coded, which only works for An. gambiae. I removed the contig parameters from the VcfToZarr task, which means contigs should be discovered from the input VCF.

alimanfoo commented 2 years ago

Just to add I've run this on Terra with success and it does go a lot faster, from 58 minutes down to 7.

alimanfoo commented 2 years ago

@gbggrant I'm getting a load of WDL validation errors from CI here which don't look related to any of the changes in this PR. Do you know what's happening by any chance?

gbggrant commented 2 years ago

@alimanfoo you added the new parameter input_vcf_index to VcfToZarr. That method is also used in the ReadBackedPhasing pipeline - I've included the parameter there (where I don't think it's used since we don't take that code path). This has fixed the miniwdl error (but dang, it does spew a lot of warnings). I'm going to ask @nikellepetrillo to look at this PR too - just so that she is aware of the change to Phasing/ReadBackedPhasing.

alimanfoo commented 2 years ago

Thanks a lot @gbggrant for taking a look and fixing up the parameters in the phasing pipeline. I think you're right that we don't actually take that code path currently, rather we provide the SNP genotypes in zipped zarr format as input, but it's good to leave open the VCF option.

@nikellepetrillo how is this looking to you?

nikellepetrillo commented 2 years ago

@alimanfoo this all looks good to me. Ty @gbggrant for adding this new input to the phasing workflow!

alimanfoo commented 2 years ago

Thanks @gbggrant and @nikellepetrillo :pray: