vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

Bad handling of exploded multiallelic and/or overlapping variants when adding allele paths and building a gPBWT #447

Open adamnovak opened 8 years ago

adamnovak commented 8 years ago

The 1000 Genomes chr22 VCF ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz contains records with the same rsID and different alts.

The allele path mechanism in vg construct expects rsIDs to be unique, because it uses them to name its allele paths, so it will produce broken allele paths in these cases. Duplicate rsIDs ought to be an error for vg construct -a.

Even if it didn't do that, when you take allele paths that partially overlap or go backwards, such as you get from overlapping variants or from multiallelic variants expressed as multiple records (which are overlapping variants), and try to make a gPBWT, it just sort of blindly strings the allele paths together. When it notices that you can't go from the end of the deletion to the start of the nested SNP (because there's no edge in the graph), it will break the gPBWT haplotype at that position.

It might be nice to interpret "0" alleles specially when the same sample is also genotyped for an overlapping site. A 0|1 on a deletion and 1|0 on a different partially overlapping deletion should be read as denoting 0 copies of the deleted part common to both deletions. However, this reeks of ill-definedness.

What we should probably do is catch these sorts of overlaps and complain to the user. We could maybe even refuse to make the gPBWT until they normalize their VCF and merge overlapping sites.

ekg commented 8 years ago

What we should probably do is catch these sorts of overlaps and complain to the user. We could maybe even refuse to make the gPBWT until they normalize their VCF and merge overlapping sites.

One problem is that normalization may require going back to the source data used to construct the VCF, which many people (us included, for the 1000G case) may not be able to complete easily.

It might be nice to interpret "0" alleles specially when the same sample is also genotyped for an overlapping site. A 0|1 on a deletion and 1|0 on a different partially overlapping deletion should be read as denoting 0 copies of the deleted part common to both deletions. However, this reeks of ill-definedness.

We could take a more limited approach, and only break the specific haplotype thread where the overlap occurs, considering overlapping reference alleles as being in agreement. This is a bit tricky, but I think it is justifiable. It exploits the fact that VCF treats the REF allele as special and constant so that we can ignore REF (or "0") cases where there are overlapping reports.

To keep working now, we can use vcfgeno2haplo from vcflib to detect the errors and filter them out. This could be done by removing the genotype calls where there are inconsistencies. Alternatively, we could filter out all but the first site where inconsistencies occur. I don't like this option but it's worth putting on the table. We could similarly remove all sites where there are overlaps. I also think this is unnecessarily destructive.

I've recently patched vcfgeno2haplo to output the samples where the invalid overlaps in the haplotypes occur. We can use this to drive the filtering. Alternatively, we can embed the same logic in vg construct, with appropriate warnings. It would be best if this wasn't an issue, but it seems very unlikely that we will escape it as VCF provides no guarantee that we're going to get coherent descriptions of the variation.