Open grahamgower opened 1 year ago
Good to know, thanks @grahamgower. Do you have an MWE to reproduce?
This issue arose from a tree sequence produced by SLiM, so I've used the SLiMMutationModel below. I guess other mutation models could also be problematic?
# a.py
import sys
import msprime
ts = msprime.sim_ancestry(
10, population_size=10000, sequence_length=100_000, random_seed=1,
)
ts = msprime.sim_mutations(
ts, model=msprime.SLiMMutationModel(0), rate=1e-8, random_seed=2,
)
ts.write_vcf(sys.stdout)
Filtering the vcf to exclude TYPE!=snp will exclude most of the sites. (Note that -H
just omits the vcf header in the output)
$ python a.py > a.vcf
$ bcftools view -H a.vcf | wc -l
128
$ bcftools view -H -e "TYPE!='snp'" a.vcf | wc -l
10
Righhht, I can see how the SLiM mutations would cause a problem. I'm not sure what we can do here - we'd have to remap the alleles to some arbitrary strings of ACTG, which might be confusing in itself. I guess this would be OK most of the time, and we could provide an option to turn off the feature (for people who want to see the original alleles)?
Maybe the correct thing to do is to enclose the alleles in angle brackets? So, <1>
instead of 1
? This would still cause problems for my filtering (I guess), but at least it would be more in line with the VCF spec.
Would that keep the regex happy? It's pretty specific isn't it?
^([ACGTNacgtn]+|*|.)$
Oh, that regex seems incomplete! This is the text from the VCFv4.3 spec:
ALT — alternate base(s): Comma-separated list of alternate non-reference alleles. These alleles do not have
to be called in any of the samples. Options are base Strings made up of the bases A,C,G,T,N (case insensitive)
or the ‘*’ symbol (allele missing due to overlapping deletion) or a MISSING value ‘.’ (no variant) or an
angle-bracketed ID String (“<ID>”) or a breakend replacement string as described in Section 5.4. If there
are no alternative alleles, then the MISSING value must be used. In other words, the ALT field must be a
symbolic allele, or a breakend replacement string, or match the regular expression ^([ACGTNacgtn]+|\*|\.)$.
Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case
sensitive. (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself)
And from the VCF4.2 spec:
ALT - alternate base(s): Comma separated list of alternate non-reference alleles. These alleles do not have to
be called in any of the samples. Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive)
or an angle-bracketed ID String (“<ID>”) or a breakend replacement string as described in the section on
breakends. The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion. If there
are no alternative alleles, then the missing value should be used. Tools processing VCF files are not required
to preserve case in the allele String, except for IDs, which are case sensitive. (String; no whitespace, commas,
or angle-brackets are permitted in the ID String itself)
Ah, I see - we'd create IDs for the alleles. I think that's a reasonable thing to do?
Ack, sorry, I should have noticed this issue earlier. This is the reason for pyslim.convert_alleles
, preceded by pyslim.generate_nucleotides
if what you have simulated in SLiM is not a nucleotide model.
So, doing
pyslim.convert_alleles(pyslim.generate_nucleotides(ts)).write_vcf(...)
should give you valid VCF from a SLiM simulation.
It seems that the ALT field being produce by
write_vcf()
is an integer. The VCF v4.2 spec indicates that the ALT field should be a nucleotide, a star, or a symbolic allele in angle brackets. The VCF v4.3 spec helpfully provides a regex:^([ACGTNacgtn]+|\*|\.)$
.This non-conformance becomes a problem when using
bcftools
filters. As a concrete example, I'm using the exclusion filter with TYPE!='snp' (bcftools view -e "TYPE!='snp'"
), which misinterprets the integers in the ALT column and thus excludes almost all sites.Ping @MleGhis.