samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
674 stars 240 forks source link

Understanding order of operations when filtering #1971

Closed reubwn closed 1 year ago

reubwn commented 1 year ago

Hi all,

I'm wondering if someone might be able to explain the following observation.

  1. When applying -a -c1 before -i 'TYPE="snp"', the first 20 entries of my VCF is:
    >> bcftools view -S samples.txt mydata.bcf -Ou | bcftools view -a -c1 -Ou | bcftools view -i 'TYPE="snp"' -Ou | bcftools query -f '%CHROM %POS %REF %ALT %AN %AC\n' | head -20
    Pf3D7_01_v3 98868 A G 28 18
    Pf3D7_01_v3 100315 G C 28 8
    Pf3D7_01_v3 100325 C A 28 6
    Pf3D7_01_v3 100330 A G 28 16
    Pf3D7_01_v3 100369 G T 28 2
    Pf3D7_01_v3 100608 A G 28 10
    Pf3D7_01_v3 101269 G T 28 28
    Pf3D7_01_v3 101705 CAAATGTAGAACATGATGCTGAAGA AAAATGTAGAACATGATGCTGAAGA 26 2
    Pf3D7_01_v3 101832 AGAAGAAAATGTAGAAGAAAATGTAGAAGAAAATGTTGAAGAAAATGTAGAAGAAAATGTTGAAGAAAATGTAGAAGAAAATGTAGAAGAAAATGTTGAAGAATATGATGAAGAAAATGTTGAAGAAGTAGAAGAAAATGTTGAAGAATATGAT TGAAGAAAATGTAGAAGAAAATGTAGAAGAAAATGTTGAAGAAAATGTAGAAGAAAATGTTGAAGAAAATGTAGAAGAAAATGTAGAAGAAAATGTTGAAGAATATGATGAAGAAAATGTTGAAGAAGTAGAAGAAAATGTTGAAGAATATGAT 26 1
    Pf3D7_01_v3 104775 G T 28 2
    Pf3D7_01_v3 105189 G C 28 4
    Pf3D7_01_v3 111503 C A,T 28 8,6
    Pf3D7_01_v3 111635 C T 28 2
    Pf3D7_01_v3 111774 G T 28 2
    Pf3D7_01_v3 112172 T G 28 2
    Pf3D7_01_v3 112619 G A 28 2
    Pf3D7_01_v3 113105 CTTCTTCTTG GTTCTTCTTG 28 4
    Pf3D7_01_v3 113889 A T 28 2
    Pf3D7_01_v3 113892 T A 28 2
    Pf3D7_01_v3 114473 T G 28 18
  2. But when -a -c1 comes after -i 'TYPE="snp"', the first 20 entries are:
    >> bcftools view -S samples.txt mydata.bcf -Ou | bcftools view -i 'TYPE="snp"' -Ou | bcftools view -a -c1 -Ou | bcftools query -f '%CHROM %POS %REF %ALT %AN %AC\n' | head -20
    Pf3D7_01_v3 98868 A G 28 18
    Pf3D7_01_v3 100608 A G 28 10
    Pf3D7_01_v3 101269 G T 28 28
    Pf3D7_01_v3 104775 G T 28 2
    Pf3D7_01_v3 105189 G C 28 4
    Pf3D7_01_v3 112619 G A 28 2
    Pf3D7_01_v3 113889 A T 28 2
    Pf3D7_01_v3 113892 T A 28 2
    Pf3D7_01_v3 114473 T G 28 18
    Pf3D7_01_v3 114804 C T 28 2
    Pf3D7_01_v3 115610 C T 28 2
    Pf3D7_01_v3 119966 A G 28 2
    Pf3D7_01_v3 119985 T C 28 2
    Pf3D7_01_v3 120731 C A 28 6
    Pf3D7_01_v3 120736 T A 28 2
    Pf3D7_01_v3 120746 C T 28 2
    Pf3D7_01_v3 125303 C T 28 12
    Pf3D7_01_v3 125305 A T 28 2
    Pf3D7_01_v3 130339 C T 28 22
    Pf3D7_01_v3 130506 T C 28 2

    From the bcftools documentation I understand that the order that filters are applied can obviously make a big difference to the result, especially when applying a subset command as here (hence splitting the filtering steps into separate commands to make the order explicit).

But I don't understand why the placement of the -a (which drops unused alleles from the ALT column, useful for viewing after subsetting) and -c1 command (removing sites not found in the subsample; again useful for viewing) would affect the final result? I would have expected the two commands to be independent of one another, since TYPE="snp" should remain true regardless of the superfluous information being removed (and both come after the initial subset command).

For example, why has the SNP at position 100315, which seems a perfectly good call (with 8 instances of the ALT allele among the subsampled genotypes), been lost in scenario 2?

I'm sure there's a simple explanation but I'm at a loss to understand it.

Thanks!

Ps I'm using bcftools 1.16, htslib 1.16

reubwn commented 1 year ago

Aha, think I've discovered what is going on...

By removing the TYPE="snp" invocation and just manually filtering out sites with no ALT allele in the subsample I can see that the sites that are mysteriously disappearing all have an * in the (unedited) ALT column, indicating that a spanning/overlapping deletion exists somewhere else in the callset.

E.g. the calls at positions 100315, 100325, 100330 and 100369 below, which are all "missing" from scenario 2 above.

>> bcftools view -S Pf7_samples/fws_95/Zomba.txt -Ou Pf3D7_core.vsnps.bcf | bcftools query -f '%CHROM %POS %REF %ALT %AN %AC\n' | perl -lane 'use List::Util qw /sum/; @c=split(",",$F[-1]); print if sum(@c)>0' | head 
Pf3D7_01_v3 98868 A G,T 28 18,0
Pf3D7_01_v3 100315 G C,* 28 8,0
Pf3D7_01_v3 100325 C A,T,* 28 6,0,0
Pf3D7_01_v3 100330 A G,* 28 16,0
Pf3D7_01_v3 100369 G T,* 28 2,0
Pf3D7_01_v3 100461 CGTAGAAGAACCAACTGTTGCTGAAGAACAT CGTAGAAGAACCAACTGTTGCTGAAGAACATGTAGAAGAACCAACTGTTGCTGAAGAACAT,CGTAGAAGAACCAACTGTTGCTGATGAACACGTAGAAGAACCAACTGTTGCTGAAGAACATGTAGAAGAACCAACTGTTGCTGAAGAACAT,TGTAGAAGAACCAACTGTTGCTGAAGAACAT,*,C,CGTAGAAGAACCAACTGTTGCTGAAGAACATGTAGAAGAACCAACTGTTGCTGAAGAACATGTAGAAGAACCAACTGTTGCTGAAGAACAT 28 2,0,0,0,0,0
Pf3D7_01_v3 100491 TGTAGAAGAACCAACTGTTGCTGAAGAACAC TGTAGAAGAACCAACTGTTGCTGAAGAACACGTAGAAGAACCAACTGTTGCTGAAGAACAC,CGTAGAAGAACCAACTGTTGCTGAAGAACAC,T,*,TGTAGAAGAACCAACTGTTGCTGAAGAACATGTAGAAGAACCAACTGTTGCTGAAGAACACGTAGAAGAACCAACTGTTGCTGAAGAACAC,TGTAGAAGAACCAACTGTTGCTGAAGAACACGTAGAAGAACCAACTGTTGCTGAAGAACACGTAGAAGAACCAACTGTTGCTGAAGAACAC 28 2,0,0,0,0,0
Pf3D7_01_v3 100608 A G 28 10
Pf3D7_01_v3 101269 G T 28 28
Pf3D7_01_v3 101705 CAAATGTAGAACATGATGCTGAAGA C,*,AAAATGTAGAACATGATGCTGAAGA,CAAATGTAGAACATGATGCTGAAGAAAATGTAGAACATGATGCTGAAGA,TAAATGTAGAACATGATGCTGAAGA,CATGATGCTGAAGAAAATGTAGAACATGATGCTGAAGA 26 0,0,2,0,0,0

The TYPE="snp" filter looks at the ALT column and specifies that only SNPs should pass, and so filters these sites because it finds something that isn't a SNP in the ALT column, even though it doesn't exist in the subsample. By trimming unused ALTs prior to filtering for SNPs, the * are dropped and the site passes. Doh, silly me!

Still, I wonder if the default behaviour for -s/-S should be to only show sites/ALTs present in the sample? i.e., automatically switch on the behaviour caused by -a -c1, forcing the user to switch it off if you wanted to look at ALTs present in the callset as a whole but not in the subset, for some reason.