Closed reubwn closed 1 year ago
Aha, think I've discovered what is going on...
By removing the TYPE="snp"
invocation and just manually filtering out sites with no ALT allele in the subsample I can see that the sites that are mysteriously disappearing all have an *
in the (unedited) ALT column, indicating that a spanning/overlapping deletion exists somewhere else in the callset.
E.g. the calls at positions 100315, 100325, 100330 and 100369 below, which are all "missing" from scenario 2 above.
>> bcftools view -S Pf7_samples/fws_95/Zomba.txt -Ou Pf3D7_core.vsnps.bcf | bcftools query -f '%CHROM %POS %REF %ALT %AN %AC\n' | perl -lane 'use List::Util qw /sum/; @c=split(",",$F[-1]); print if sum(@c)>0' | head
Pf3D7_01_v3 98868 A G,T 28 18,0
Pf3D7_01_v3 100315 G C,* 28 8,0
Pf3D7_01_v3 100325 C A,T,* 28 6,0,0
Pf3D7_01_v3 100330 A G,* 28 16,0
Pf3D7_01_v3 100369 G T,* 28 2,0
Pf3D7_01_v3 100461 CGTAGAAGAACCAACTGTTGCTGAAGAACAT CGTAGAAGAACCAACTGTTGCTGAAGAACATGTAGAAGAACCAACTGTTGCTGAAGAACAT,CGTAGAAGAACCAACTGTTGCTGATGAACACGTAGAAGAACCAACTGTTGCTGAAGAACATGTAGAAGAACCAACTGTTGCTGAAGAACAT,TGTAGAAGAACCAACTGTTGCTGAAGAACAT,*,C,CGTAGAAGAACCAACTGTTGCTGAAGAACATGTAGAAGAACCAACTGTTGCTGAAGAACATGTAGAAGAACCAACTGTTGCTGAAGAACAT 28 2,0,0,0,0,0
Pf3D7_01_v3 100491 TGTAGAAGAACCAACTGTTGCTGAAGAACAC TGTAGAAGAACCAACTGTTGCTGAAGAACACGTAGAAGAACCAACTGTTGCTGAAGAACAC,CGTAGAAGAACCAACTGTTGCTGAAGAACAC,T,*,TGTAGAAGAACCAACTGTTGCTGAAGAACATGTAGAAGAACCAACTGTTGCTGAAGAACACGTAGAAGAACCAACTGTTGCTGAAGAACAC,TGTAGAAGAACCAACTGTTGCTGAAGAACACGTAGAAGAACCAACTGTTGCTGAAGAACACGTAGAAGAACCAACTGTTGCTGAAGAACAC 28 2,0,0,0,0,0
Pf3D7_01_v3 100608 A G 28 10
Pf3D7_01_v3 101269 G T 28 28
Pf3D7_01_v3 101705 CAAATGTAGAACATGATGCTGAAGA C,*,AAAATGTAGAACATGATGCTGAAGA,CAAATGTAGAACATGATGCTGAAGAAAATGTAGAACATGATGCTGAAGA,TAAATGTAGAACATGATGCTGAAGA,CATGATGCTGAAGAAAATGTAGAACATGATGCTGAAGA 26 0,0,2,0,0,0
The TYPE="snp"
filter looks at the ALT column and specifies that only SNPs should pass, and so filters these sites because it finds something that isn't a SNP in the ALT column, even though it doesn't exist in the subsample. By trimming unused ALTs prior to filtering for SNPs, the *
are dropped and the site passes. Doh, silly me!
Still, I wonder if the default behaviour for -s
/-S
should be to only show sites/ALTs present in the sample? i.e., automatically switch on the behaviour caused by -a -c1
, forcing the user to switch it off if you wanted to look at ALTs present in the callset as a whole but not in the subset, for some reason.
Hi all,
I'm wondering if someone might be able to explain the following observation.
-a -c1
before-i 'TYPE="snp"'
, the first 20 entries of my VCF is:-a -c1
comes after-i 'TYPE="snp"'
, the first 20 entries are:From the bcftools documentation I understand that the order that filters are applied can obviously make a big difference to the result, especially when applying a subset command as here (hence splitting the filtering steps into separate commands to make the order explicit).
But I don't understand why the placement of the
-a
(which drops unused alleles from the ALT column, useful for viewing after subsetting) and-c1
command (removing sites not found in the subsample; again useful for viewing) would affect the final result? I would have expected the two commands to be independent of one another, sinceTYPE="snp"
should remain true regardless of the superfluous information being removed (and both come after the initial subset command).For example, why has the SNP at position 100315, which seems a perfectly good call (with 8 instances of the ALT allele among the subsampled genotypes), been lost in scenario 2?
I'm sure there's a simple explanation but I'm at a loss to understand it.
Thanks!
Ps I'm using bcftools 1.16, htslib 1.16