tskit-dev / tskit

Population-scale genomics
MIT License
157 stars 73 forks source link

VCF ALT fields are non-conforming, and thus misinterpreted by bcftools #2746

Open grahamgower opened 1 year ago

grahamgower commented 1 year ago

It seems that the ALT field being produce by write_vcf() is an integer. The VCF v4.2 spec indicates that the ALT field should be a nucleotide, a star, or a symbolic allele in angle brackets. The VCF v4.3 spec helpfully provides a regex: ^([ACGTNacgtn]+|\*|\.)$.

This non-conformance becomes a problem when using bcftools filters. As a concrete example, I'm using the exclusion filter with TYPE!='snp' (bcftools view -e "TYPE!='snp'"), which misinterprets the integers in the ALT column and thus excludes almost all sites.

Ping @MleGhis.

jeromekelleher commented 1 year ago

Good to know, thanks @grahamgower. Do you have an MWE to reproduce?

grahamgower commented 1 year ago

This issue arose from a tree sequence produced by SLiM, so I've used the SLiMMutationModel below. I guess other mutation models could also be problematic?

# a.py
import sys
import msprime

ts = msprime.sim_ancestry(
    10, population_size=10000, sequence_length=100_000, random_seed=1,
)
ts = msprime.sim_mutations(
    ts, model=msprime.SLiMMutationModel(0), rate=1e-8, random_seed=2,
)
ts.write_vcf(sys.stdout)

Filtering the vcf to exclude TYPE!=snp will exclude most of the sites. (Note that -H just omits the vcf header in the output)

$ python a.py > a.vcf
$ bcftools view -H a.vcf | wc -l
128
$ bcftools view -H -e "TYPE!='snp'" a.vcf | wc -l
10
jeromekelleher commented 1 year ago

Righhht, I can see how the SLiM mutations would cause a problem. I'm not sure what we can do here - we'd have to remap the alleles to some arbitrary strings of ACTG, which might be confusing in itself. I guess this would be OK most of the time, and we could provide an option to turn off the feature (for people who want to see the original alleles)?

grahamgower commented 1 year ago

Maybe the correct thing to do is to enclose the alleles in angle brackets? So, <1> instead of 1? This would still cause problems for my filtering (I guess), but at least it would be more in line with the VCF spec.

jeromekelleher commented 1 year ago

Would that keep the regex happy? It's pretty specific isn't it?

^([ACGTNacgtn]+|*|.)$

grahamgower commented 1 year ago

Oh, that regex seems incomplete! This is the text from the VCFv4.3 spec:

ALT — alternate base(s): Comma-separated list of alternate non-reference alleles. These alleles do not have
to be called in any of the samples. Options are base Strings made up of the bases A,C,G,T,N (case insensitive)
or the ‘*’ symbol (allele missing due to overlapping deletion) or a MISSING value ‘.’ (no variant) or an
angle-bracketed ID String (“<ID>”) or a breakend replacement string as described in Section 5.4. If there
are no alternative alleles, then the MISSING value must be used. In other words, the ALT field must be a
symbolic allele, or a breakend replacement string, or match the regular expression ^([ACGTNacgtn]+|\*|\.)$.
Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case
sensitive. (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself)

And from the VCF4.2 spec:

ALT - alternate base(s): Comma separated list of alternate non-reference alleles. These alleles do not have to
be called in any of the samples. Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive)
or an angle-bracketed ID String (“<ID>”) or a breakend replacement string as described in the section on
breakends. The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion. If there
are no alternative alleles, then the missing value should be used. Tools processing VCF files are not required
to preserve case in the allele String, except for IDs, which are case sensitive. (String; no whitespace, commas,
or angle-brackets are permitted in the ID String itself)
jeromekelleher commented 1 year ago

Ah, I see - we'd create IDs for the alleles. I think that's a reasonable thing to do?

petrelharp commented 1 year ago

Ack, sorry, I should have noticed this issue earlier. This is the reason for pyslim.convert_alleles, preceded by pyslim.generate_nucleotides if what you have simulated in SLiM is not a nucleotide model.

So, doing

pyslim.convert_alleles(pyslim.generate_nucleotides(ts)).write_vcf(...)

should give you valid VCF from a SLiM simulation.