sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
24 stars 6 forks source link

SVCR Local alleles #185

Open jeromekelleher opened 4 months ago

jeromekelleher commented 4 months ago

We need to implement the "local alleles" approach to deal with overblown PL fields.

See discussion: https://github.com/sgkit-dev/vcf-zarr-publication/issues/5

Will-Tyler commented 2 months ago

I am interested in working on this task! I looked into this some—the local alleles approach makes sense to me. For each variant site, we add a list of global alleles, which are all the possible alleles seen in the samples at that site. For each variant and sample, we add a field called local_alleles which indicates which of the global alleles at that variant site were found in the sample. We can then store the local PL values for each of the genotypes made by the local alleles.

Some of the datasets in the unit tests in test_vcf_examples.py have PL values, so I can use those as examples.

I assume the VCF Zarr spec should be updated if this approach is implemented.

jeromekelleher commented 2 months ago

That would be AMAZING @Will-Tyler!

I think the best first step would be to work up an example based on Figure 3b and 3c in the preprint. We would just want the three sites in the original VCF, and use the local alleles values (and LPL) from that example. Maybe a good start here would be to write out the values for the last variant (POS=3350) explicitly here, so we can understand what's going on, and then use these as the basis for a test case afterwards?

Will-Tyler commented 2 months ago
Thanks for the suggestion on how to get started. The PVCF representation in the example (3b) is #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 01 02
chr1 3331 . T G . . . GT:DP:GQ:AD:PL 0/1:10:99:6,4:100,0,105 0/0:9:99:9,0:0,100,200
chr1 3335 . G C . . . GT:DP:GQ:AD:PL 0/0:11:99:11,0:0,100,200 1/1:7:22:0,7:154,22,0
chr1 3350 . A C,T . . . GT:DP:GQ:AD:PL 1/1:15:55:0,15,0:1002,55,1002,0,55,1002 0/2:12:99:7,0,5:154,154,0,154,102,102
The last variant site in the SVCR representation (3c) is #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 01 02
chr1 3350 . A C,T, . . . LA:LGT:DP:GQ:LAD:LPL:END 0,1:1/1:15:55:0,15:1002,55,0:. 0,2:0/1:12:99:7,5:154,0,102:.
Using just the LA and LPL values (and not using the reference blocks), the representation becomes #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 01 02
chr1 3350 . A C,T . . . LA:GT:DP:GQ:AD:LPL 0,1:1/1:15:55:0,15,0:1002,55,0 0,2:0/2:12:99:7,0,5:154,0,102

My understanding is that the LA field is the indices of the alleles with positive, nonzero AD values, which in most cases will be two alleles for diploid chromosomes but could technically be more. In other words, the source of truth for the LA field is the AD field.

If the source of truth for the LA field was the GT field, which contains at most two alleles, then it seems like the LA field would be a redundant form of the GT field?

Will-Tyler commented 2 months ago

The local alleles approach was incorporated into the draft VCF 4.5 specification in this pull request. The draft specification mandates that if any local-allele fields are present, then the LAA field must be present. The LAA field is a list of the one-based indices of the alternate alleles observed in the sample.

It seems to me like a good next step would be to add functionality to bio2zarr so that bio2zarr infers which alternate alleles are observed in a sample and creates an LAA field during the explode step. bio2zarr can infer which alternate alleles are observed in a sample based on the GT and AD fields for that sample.

Will-Tyler commented 2 months ago

Note that bcftools contains a tag2tag plugin that lets the user interconvert between global fields and local-allele fields in VCF files.

With tag2tag being available, is it still desired to implement a local-alleles approach in bio2zarr? Or would it suffice to convert the input VCF file to local-allele fields using bcftools before using bio2zarr? If this functionality is needed in bio2zarr, should it be its own command (for example, bio2zarr localize)? Does the conversion need to happen in the input VCF files, in the output VCF Zarr files, or both need to be supported?

jeromekelleher commented 2 months ago

This is super helpful, thanks @Will-Tyler! Here's how I think it should work.

By default, if we see a PL field in the VCF (maybe later I guess any other Number=G fields) we create call_LA and call_LPL fields instead of call_PL. We do not do this if we provide a no-local-alleles argument to vcf2zarr encode or vcf2zarr mkschema. I think we can do all the actual translation work at encode time. We have a specialised method for dealing with the call_LA and call_LPL fields, just like we do for call_genotype. For simplicity, we can read in the genotypes again from the ICF (rather than complicating the existing method for encoding genotypes).

The reason for doing it this way is that the call_PL fields can get very big, and consume a lot of memory. So, we want to avoid actually creating them ever, if we don't have to. Since LA/LPL are now part of the VCF spec, I think doing the localisation by default is the correct thing (few people use PLs anyway, so there's no point in wasting space on a particularly inefficient encoding of them). A similar argument holds for us doing the conversion in bio2zarr rather than relying on the tag2tag plugin - running this on e.g., the GEL chr2 dataset would take weeks, and consume hundreds of terabytes of extra space.

For testing, we results of running our code on the original VCF should be identical to the results of vcf2zarr convert on the output of tag2tag on a PL-containing file.

Does this make sense?

Will-Tyler commented 2 months ago

Overall, I think this makes sense and gives me good direction. When I was looking into this though, I thought it would be ideal to implement the global-to-local conversion during the explode step to save space in storing the ICF data. Glancing at it, it seems that the ICF writer could compute the LAA and LPL values in the process_partition method. Can I ask why it would be better to do the transformation during the encode step?

The testing approach is a good idea—thanks for this suggestion!

jeromekelleher commented 2 months ago

Actually you're right, it should be done during explode. Are you fairly clear on next steps or would you like some pointers? You've picked up the code base very quickly, I'm impressed!

Will-Tyler commented 2 months ago

Thanks! 😄

I think I should be able to make progress from here. My plan is to start by creating the LAA field during the explode step. I'll probably create a pull request for this part. From there, I don't think it will be too difficult to compute the LPL field using the LAA and the PL fields.