sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
25 stars 7 forks source link

LPL no smaller than PL #277

Open jeromekelleher opened 1 month ago

jeromekelleher commented 1 month ago

After running conversion on 1000 Genomes chr2 data, I'm not seeing any reduction in the size of PL fields. Here's the ICF:

$ vcf2zarr inspect 1kg_chr2_lpl.icf/ | grep PL
FORMAT/PL                 Integer      6906  430.58 GiB  60.6 GiB           28  0          3.2e+05
FORMAT/LPL                Integer      6906  430.58 GiB  60.57 GiB          28  -1         3.2e+05
$ vcf2zarr inspect 1kg_chr2_lpl.icf/ | grep 'LAA'
FORMAT/LAA                Integer      4534  282.07 GiB  315.18 MiB          6  -2         6

and on the VCZ: (for a small number of variants using --max-variant-chunks)

$ vcf2zarr inspect 1kg_chr2_lpl.vcz/ | grep PL
/call_LPL                     int32    8.27 MiB    342.01 MiB   41             40  8.55 MiB      211.78 KiB          (1000, 3202, 28)  (100, 1000, 28)  Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_PL                      int32    8.27 MiB    342.01 MiB   41             40  8.55 MiB      211.78 KiB          (1000, 3202, 28)  (100, 1000, 28)  Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
$ vcf2zarr inspect 1kg_chr2_lpl.vcz/ | grep LAA
/call_LAA                     int8     102.69 KiB  18.32 MiB   180             40  469.04 KiB    2.57 KiB            (1000, 3202, 6)   (100, 1000, 6)   Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None

So, there's no reduction in the maximum dimension and the storage sizes are essentially identical.

Have you any ideas what might be going on here @Will-Tyler?

I think it would be really worthwhile getting some truth data for LPL that we could compare with. It does seem that getting running Hail is the only way to do this, so probably worth the effort.

Will-Tyler commented 1 month ago

Thanks for sharing your findings. My suspicion is that there is a relatively small subset of entries that are using a large number of alternate alleles, forcing all of the entries to have a large number of values, which would mostly be fill values. I would be curious to know which samples are using a lot of alternate alleles and why. If those entries have a lot of positive AD values but not a lot of positive PL values, then we could consider not considering the AD field when creating the LAA field. If instead those entries have a lot of positive PL values, then we may want to consider a lossy approach such as only considering the alleles listed in the genotype field when creating the LAA field.

Is the dataset that you are using public? I was having trouble finding it on the IGSR website. If you can provide a link to it, I would like to see if I can download it and evaluate it. In the meantime, I will try to find another small VCF dataset, experiment with it, and try using Hail to create some local-allele fields.

Will-Tyler commented 1 month ago

I was able to download chrY from this 1000 Genomes folder and filter for variants with many alleles. I was then able to run explode and encode and load the dataset using sgkit.

I found that there are many entries where the PL value is full of positive values. Thus, vcf2zarr explode's current implementation counts all the alternate alleles as being observed in those entries, and then puts all of the PL values in the LPL field for those entries. My guess is that something similar is happening in your dataset.

Because the PL values are full of positive values, I think vcf2zarr needs to use a lossy approach to shrink the size of the PL field. One option is to just include the PL values corresponding to the alleles referenced in the GT field. This is easy to implement, and I verified that doing this results in an LPL field with shape (1164, 2504, 6) compared with the PL field's shape of (1164, 2504, 28) in my example dataset. Please let me know what you think. I would be happy to implement this change.


I also loaded the chrY VCF into Hail and converted it to a Variant Dataset using transform_gvcf. The result VDS appears to have all the LPL values for most of the entries. The entries that I am seeing that have values have 28 values. For example,

+------------------------------------------------------------------------------+
| 'HG00096'.LPL                                                                |
+------------------------------------------------------------------------------+
| array<int32>                                                                 |
+------------------------------------------------------------------------------+
| [0,36,540,36,540,540,36,540,540,540,36,540,540,540,540,36,540,540,540,540... |
| [0,18,270,18,270,270,18,270,270,270,18,270,270,270,270,18,270,270,270,270... |
| [349,349,349,24,24,0,349,349,24,349,349,349,24,349,349,349,349,24,349,349... |
| [0,33,382,33,382,382,33,382,382,382,33,382,382,382,382,33,382,382,382,382... |
| [0,0,242,0,242,242,0,242,242,242,0,242,242,242,242,0,242,242,242,242,242,... |
| [0,0,119,0,119,119,0,119,119,119,0,119,119,119,119,0,119,119,119,119,119,... |
| [0,28,610,28,610,610,28,610,610,610,28,610,610,610,610,28,610,610,610,610... |
| [0,0,20,0,20,20,0,20,20,20,0,20,20,20,20,0,20,20,20,20,20,0,20,20,20,20,2... |
| [0,0,117,0,117,117,0,117,117,117,0,117,117,117,117,0,117,117,117,117,117,... |
| [330,330,330,330,330,330,330,330,330,330,30,30,30,30,0,330,330,330,330,30... |
+------------------------------------------------------------------------------+

The Hail code that creates the local-allele fields does not appear to be doing anything sophisticated. For example, for the LA field, it just creates a range of numbers based on the allele count instead of determining which alleles are observed in the entry.

jeromekelleher commented 1 month ago

That's excellent @Will-Tyler, well done! If all 28 values are non-zero, then there's not a great deal we can do, as you say. I don't know how useful these values are, or whether anyone really uses them, but that's a different story.

I'm surprised that this approach is so ineffective, when the SVCR paper pushes it so hard (and people have put so much trouble into getting it into the VCF standard).

I've been looking at data from the same pipeline as you (1000 Genomes, 2020) so maybe it's a property of the variant caller used there? Maybe different programs that output PLs are more discriminating? I've had a quick look at it seems that 1000 Genomes phase 1 and phase 3 didn't use PL values.

So, maybe we need to figure out which variant callers emit PLs and see if we can track down a dataset that has them?

Will-Tyler commented 1 month ago

From the VCF file, the variant caller appears to be GATK. I thought more about this and did some reading on variant calling today. First, after reading this GATK article on PL calculation, it is no longer surprising to me that the PL values are mostly positive with some zeroes. PL is $$-10 \log_{10}(P(\text{genotype} | \text{sequencing data}))$$, which is then normalized by subtracting the smallest PL value from the other PL values so that the smallest PL value is always zero. Genotypes with a PL of zero are the most-likely genotypes in the entry, and the other positive PL values correspond to less likely genotypes (the higher the PL, the less likely the corresponding genotype). Thus, the called genotype usually has a PL value of zero, while the other less-likely genotypes have positive PL values.

Perhaps vcf2zarr can only consider PL values ~below~ above (less-likely than) a certain threshold (such as 60, which corresponds to 1 in a million) for the LAA and LPL fields. Alternatively, vcf2zarr could rely on other fields such as AD for determining which alleles should be considered local. It seems reasonable to me to ignore PL values for genotypes that have an allele with zero statistically-significant reads (according to the AD field).

References

jeromekelleher commented 1 month ago

Thanks for doing the background reading @Will-Tyler, that's very helpful.

It seems reasonable to me to ignore PL values for genotypes that have an allele with zero statistically-significant reads (according to the AD field).

I agree. I guess a first pass would be to only count alleles with AD>0, and see if this gets us any gains?

tomwhite commented 1 month ago

Is it worth making --no-local-alleles the default for the moment?

Will-Tyler commented 1 month ago

I guess a first pass would be to only count alleles with AD>0, and see if this gets us any gains?

Good idea—let me try this. There are 346 entries where all of the AD values are positive, so the LAA field and the LPL field have the same shape as before ((1164, 2504, 6) and (1164, 2504, 28), respectively). But the storage size is smaller in ICF:

vcf2zarr inspect data/1kg_chrY_limited.icf | grep PL
FORMAT/PL                 Integer         7  311.46 MiB  20.11 MiB          28  0          4.6e+04
FORMAT/LPL                Integer         4  75.81 MiB   10.33 MiB          28  -1         4.6e+04

And smaller in VCF Zarr as well:

vcf2zarr inspect data/1kg_chrY_limited.vcz | grep PL
/call_PL                      int32    21.22 MiB   311.32 MiB   15              3  103.77 MiB    7.07 MiB            (1164, 2504, 28)  (10000, 1000, 28)  Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_LPL                     int32    13.17 MiB   311.32 MiB   24              3  103.77 MiB    4.39 MiB            (1164, 2504, 28)  (10000, 1000, 28)  Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None

Presumably, less alleles are being counted as local, resulting in less of the PL values being copied over and the fill value being used instead.

jeromekelleher commented 1 month ago

That is a reasonable saving all right. Hardly seems worth all the effort though, if I'm honest, as it's the dimensions of the PL field that cause problems more than than the overall storage size.

Truncating to a max of 127 seems entirely reasonable anyway, so that'll reduce our storage 4 fold here. Spending all that space just to encode precisely how unlikely the variant caller thinks a combination of alleles that have no reads are seems pretty silly.

What happens if we truncate to just keep zero and the next-most likely value?

I must do some reading to figure out how people use PLs...

tomwhite commented 1 month ago

Not sure if you are aware of this, but I found this page that mentions that bcftools has some support for local alleles, namely in the merge command, and the tag2tag plugin.

Will-Tyler commented 1 month ago

Thank you for sharing this @tomwhite! The tag2tag plugin does not support localization yet (see here), but the merge command has some interesting functionality that I had not considered before. I played around with the merge command using the --force-single and --local-alleles options.

The user tells the merge command to use a certain amount of local alleles. The merge command then decides which alleles to include by calculating, for each alternate allele, the sum of the probabilities of the genotypes that reference the allele. The merge command uses the alleles with the highest genotype probability sums as the local alleles. (Source: init_local_alleles.)

The merge command's approach only uses the PL field and the user's input to determine which alleles to make local. This approach also controls the shape of the output data. If we decide to implement the same approach, we can test against the merge command's implementation. If we want to adopt this approach in bio2zarr, I would certainly be happy to work on it.