Open jeromekelleher opened 1 month ago
Thanks for sharing your findings. My suspicion is that there is a relatively small subset of entries that are using a large number of alternate alleles, forcing all of the entries to have a large number of values, which would mostly be fill values. I would be curious to know which samples are using a lot of alternate alleles and why. If those entries have a lot of positive AD values but not a lot of positive PL values, then we could consider not considering the AD field when creating the LAA field. If instead those entries have a lot of positive PL values, then we may want to consider a lossy approach such as only considering the alleles listed in the genotype field when creating the LAA field.
Is the dataset that you are using public? I was having trouble finding it on the IGSR website. If you can provide a link to it, I would like to see if I can download it and evaluate it. In the meantime, I will try to find another small VCF dataset, experiment with it, and try using Hail to create some local-allele fields.
I was able to download chrY from this 1000 Genomes folder and filter for variants with many alleles. I was then able to run explode and encode and load the dataset using sgkit.
I found that there are many entries where the PL value is full of positive values. Thus, vcf2zarr explode's current implementation counts all the alternate alleles as being observed in those entries, and then puts all of the PL values in the LPL field for those entries. My guess is that something similar is happening in your dataset.
Because the PL values are full of positive values, I think vcf2zarr needs to use a lossy approach to shrink the size of the PL field. One option is to just include the PL values corresponding to the alleles referenced in the GT field. This is easy to implement, and I verified that doing this results in an LPL field with shape (1164, 2504, 6)
compared with the PL field's shape of (1164, 2504, 28)
in my example dataset. Please let me know what you think. I would be happy to implement this change.
I also loaded the chrY VCF into Hail and converted it to a Variant Dataset using transform_gvcf
. The result VDS appears to have all the LPL values for most of the entries. The entries that I am seeing that have values have 28 values. For example,
+------------------------------------------------------------------------------+
| 'HG00096'.LPL |
+------------------------------------------------------------------------------+
| array<int32> |
+------------------------------------------------------------------------------+
| [0,36,540,36,540,540,36,540,540,540,36,540,540,540,540,36,540,540,540,540... |
| [0,18,270,18,270,270,18,270,270,270,18,270,270,270,270,18,270,270,270,270... |
| [349,349,349,24,24,0,349,349,24,349,349,349,24,349,349,349,349,24,349,349... |
| [0,33,382,33,382,382,33,382,382,382,33,382,382,382,382,33,382,382,382,382... |
| [0,0,242,0,242,242,0,242,242,242,0,242,242,242,242,0,242,242,242,242,242,... |
| [0,0,119,0,119,119,0,119,119,119,0,119,119,119,119,0,119,119,119,119,119,... |
| [0,28,610,28,610,610,28,610,610,610,28,610,610,610,610,28,610,610,610,610... |
| [0,0,20,0,20,20,0,20,20,20,0,20,20,20,20,0,20,20,20,20,20,0,20,20,20,20,2... |
| [0,0,117,0,117,117,0,117,117,117,0,117,117,117,117,0,117,117,117,117,117,... |
| [330,330,330,330,330,330,330,330,330,330,30,30,30,30,0,330,330,330,330,30... |
+------------------------------------------------------------------------------+
The Hail code that creates the local-allele fields does not appear to be doing anything sophisticated. For example, for the LA field, it just creates a range of numbers based on the allele count instead of determining which alleles are observed in the entry.
That's excellent @Will-Tyler, well done! If all 28 values are non-zero, then there's not a great deal we can do, as you say. I don't know how useful these values are, or whether anyone really uses them, but that's a different story.
I'm surprised that this approach is so ineffective, when the SVCR paper pushes it so hard (and people have put so much trouble into getting it into the VCF standard).
I've been looking at data from the same pipeline as you (1000 Genomes, 2020) so maybe it's a property of the variant caller used there? Maybe different programs that output PLs are more discriminating? I've had a quick look at it seems that 1000 Genomes phase 1 and phase 3 didn't use PL values.
So, maybe we need to figure out which variant callers emit PLs and see if we can track down a dataset that has them?
From the VCF file, the variant caller appears to be GATK. I thought more about this and did some reading on variant calling today. First, after reading this GATK article on PL calculation, it is no longer surprising to me that the PL values are mostly positive with some zeroes. PL is $$-10 \log_{10}(P(\text{genotype} | \text{sequencing data}))$$, which is then normalized by subtracting the smallest PL value from the other PL values so that the smallest PL value is always zero. Genotypes with a PL of zero are the most-likely genotypes in the entry, and the other positive PL values correspond to less likely genotypes (the higher the PL, the less likely the corresponding genotype). Thus, the called genotype usually has a PL value of zero, while the other less-likely genotypes have positive PL values.
Perhaps vcf2zarr can only consider PL values ~below~ above (less-likely than) a certain threshold (such as 60, which corresponds to 1 in a million) for the LAA and LPL fields. Alternatively, vcf2zarr could rely on other fields such as AD for determining which alleles should be considered local. It seems reasonable to me to ignore PL values for genotypes that have an allele with zero statistically-significant reads (according to the AD field).
Thanks for doing the background reading @Will-Tyler, that's very helpful.
It seems reasonable to me to ignore PL values for genotypes that have an allele with zero statistically-significant reads (according to the AD field).
I agree. I guess a first pass would be to only count alleles with AD>0, and see if this gets us any gains?
Is it worth making --no-local-alleles
the default for the moment?
I guess a first pass would be to only count alleles with AD>0, and see if this gets us any gains?
Good idea—let me try this. There are 346 entries where all of the AD values are positive, so the LAA field and the LPL field have the same shape as before ((1164, 2504, 6)
and (1164, 2504, 28)
, respectively). But the storage size is smaller in ICF:
vcf2zarr inspect data/1kg_chrY_limited.icf | grep PL
FORMAT/PL Integer 7 311.46 MiB 20.11 MiB 28 0 4.6e+04
FORMAT/LPL Integer 4 75.81 MiB 10.33 MiB 28 -1 4.6e+04
And smaller in VCF Zarr as well:
vcf2zarr inspect data/1kg_chrY_limited.vcz | grep PL
/call_PL int32 21.22 MiB 311.32 MiB 15 3 103.77 MiB 7.07 MiB (1164, 2504, 28) (10000, 1000, 28) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/call_LPL int32 13.17 MiB 311.32 MiB 24 3 103.77 MiB 4.39 MiB (1164, 2504, 28) (10000, 1000, 28) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
Presumably, less alleles are being counted as local, resulting in less of the PL values being copied over and the fill value being used instead.
That is a reasonable saving all right. Hardly seems worth all the effort though, if I'm honest, as it's the dimensions of the PL field that cause problems more than than the overall storage size.
Truncating to a max of 127 seems entirely reasonable anyway, so that'll reduce our storage 4 fold here. Spending all that space just to encode precisely how unlikely the variant caller thinks a combination of alleles that have no reads are seems pretty silly.
What happens if we truncate to just keep zero and the next-most likely value?
I must do some reading to figure out how people use PLs...
Thank you for sharing this @tomwhite! The tag2tag plugin does not support localization yet (see here), but the merge command has some interesting functionality that I had not considered before. I played around with the merge command using the --force-single
and --local-alleles
options.
The user tells the merge command to use a certain amount of local alleles. The merge command then decides which alleles to include by calculating, for each alternate allele, the sum of the probabilities of the genotypes that reference the allele. The merge command uses the alleles with the highest genotype probability sums as the local alleles. (Source: init_local_alleles
.)
The merge command's approach only uses the PL field and the user's input to determine which alleles to make local. This approach also controls the shape of the output data. If we decide to implement the same approach, we can test against the merge command's implementation. If we want to adopt this approach in bio2zarr, I would certainly be happy to work on it.
After running conversion on 1000 Genomes chr2 data, I'm not seeing any reduction in the size of PL fields. Here's the ICF:
and on the VCZ: (for a small number of variants using --max-variant-chunks)
So, there's no reduction in the maximum dimension and the storage sizes are essentially identical.
Have you any ideas what might be going on here @Will-Tyler?
I think it would be really worthwhile getting some truth data for LPL that we could compare with. It does seem that getting running Hail is the only way to do this, so probably worth the effort.