Closed alimanfoo closed 2 years ago
Just to add I've run this on Terra with success and it does go a lot faster, from 58 minutes down to 7.
@gbggrant I'm getting a load of WDL validation errors from CI here which don't look related to any of the changes in this PR. Do you know what's happening by any chance?
@alimanfoo you added the new parameter input_vcf_index to VcfToZarr. That method is also used in the ReadBackedPhasing pipeline - I've included the parameter there (where I don't think it's used since we don't take that code path). This has fixed the miniwdl error (but dang, it does spew a lot of warnings). I'm going to ask @nikellepetrillo to look at this PR too - just so that she is aware of the change to Phasing/ReadBackedPhasing.
Thanks a lot @gbggrant for taking a look and fixing up the parameters in the phasing pipeline. I think you're right that we don't actually take that code path currently, rather we provide the SNP genotypes in zipped zarr format as input, but it's good to leave open the VCF option.
@nikellepetrillo how is this looking to you?
@alimanfoo this all looks good to me. Ty @gbggrant for adding this new input to the phasing workflow!
Thanks @gbggrant and @nikellepetrillo :pray:
Running the SNP genotyping pipeline on Terra I noticed that the VcfToZarr task is taking much longer than expected.
Looking at the logs, some contigs are taking a long time to parse the first chunk. This would be expected if no index file is being used, so all previous rows have to be scanned to find the first row for a given contig.
This PR modifies the VcfToZarr task to include the VCF index file as an input, and to use the bgzipped VCF file.
Also the contig options were previously hard-coded, which only works for An. gambiae. I removed the contig parameters from the VcfToZarr task, which means contigs should be discovered from the input VCF.