samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
663 stars 240 forks source link

Settings to reduce reference bias? #2093

Closed elsemikk closed 7 months ago

elsemikk commented 8 months ago

Hello,

I was wondering if there are any settings in bcftools that can avoid favouring the reference allele over the alternate allele during genotype calling? I am studying speciation genomics/phylogenomics in birds have found that reference bias can lead to some incorrect downstream inferences when looking for subtle signatures of gene flow etc. between species when dealing with low depth samples.

I am looking for a way to call genotypes without using the reference genome base to inform which genotype is more likely. For example, consider two heterozygous samples sequenced to 5x depth. By chance, one has 1 ref and 4 alt alleles, and the other has 4 ref and 1 alt alleles. With single sample calling, the first will be called as heterozygous but the second will be called as homozygous reference (I assume because the 1 alt allele can't be distinguished from a sequencing error). These are probably the best calls to make in this situation, but it means that the low depth samples are biased against having their derived alleles called. This can lead to some incorrect results with analyses that are sensitive to the number of derived alleles in a sample (for example, ABBA BABA tests and other widely used tests for hybridization/gene flow), especially when combining samples with different levels of sequencing depth (and therefore influenced to different degrees by this reference bias). In an ideal world, all samples would be sequenced to very high depth where this would no longer be a major issue, but sometimes that is just not possible with rare samples. I understand that using the reference genome to inform the genotyping can lead to greater accuracy, but in this case I would prefer more missing data or data that is less accurate rather than data that is more accurate in a biased direction. Is there any way to do this within bcftools?

I was considering replacing variable sites with ambiguity codes or N's in my reference genome, but I'm not sure what effect that would have on the genotype calling, and it requires some circularity in first needing to call genotypes to identify which sites are variable before they can be replaced with ambiguity codes.

here is an illustration in case it helps:

At a 150 bp region, one heterozygote has 4 A's and 1 T allele at the first position: het_4r1a.sam

@HD VN:1.6  SO:coordinate
@SQ SN:scaffold1    LN:150
1   99  scaffold1   1   60  150M    =   1   150 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ  NM:i:0  MD:Z:150    AS:i:80 MQ:i:60
2   99  scaffold1   1   60  150M    =   1   150 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ  NM:i:0  MD:Z:150    AS:i:80 MQ:i:60
3   99  scaffold1   1   60  150M    =   1   150 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ  NM:i:0  MD:Z:150    AS:i:80 MQ:i:60
4   99  scaffold1   1   60  150M    =   1   150 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ  NM:i:0  MD:Z:150    AS:i:80 MQ:i:60
5   99  scaffold1   1   60  150M    =   1   150 TAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ  NM:i:1  MD:Z:A149   AS:i:80 MQ:i:60

Another heterozgous sample by chance has 1 A and 4 T alleles: het_1r4a.sam

@HD VN:1.6  SO:coordinate
@SQ SN:scaffold1    LN:150
1   99  scaffold1   1   60  150M    =   1   150 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ  NM:i:0  MD:Z:150    AS:i:80 MQ:i:60
2   99  scaffold1   1   60  150M    =   1   150 TAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ  NM:i:1  MD:Z:A149   AS:i:80 MQ:i:60
3   99  scaffold1   1   60  150M    =   1   150 TAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ  NM:i:1  MD:Z:A149   AS:i:80 MQ:i:60
4   99  scaffold1   1   60  150M    =   1   150 TAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ  NM:i:1  MD:Z:A149   AS:i:80 MQ:i:60
5   99  scaffold1   1   60  150M    =   1   150 TAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ  NM:i:1  MD:Z:A149   AS:i:80 MQ:i:60

The reference genome allele is A: ref.fa

>scaffold1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

We call their genotypes:

bcftools-1.19/bcftools mpileup -B -a AD,DP,SP,QS -Ou -f ref.fa het_4r1a.sam | bcftools-1.19/bcftools call -f GQ,GP -m > het_4r1a.vcf

bcftools-1.19/bcftools mpileup -B -a AD,DP,SP,QS -Ou -f ref.fa het_1r4a.sam | bcftools-1.19/bcftools call -f GQ,GP -m > het_1r4a.vcf

The first sample is called as homozygous reference:

scaffold1       1       .       A       .       9.08516 .       DP=5;SGB=-0.379885;RPBZ=0;MQBZ=0;BQBZ=0;SCBZ=0;MQ0F=0;AN=2;DP4=4,0,1,0;MQ=60    GT:DP:SP:AD:QS  0/0:5:0:4:164

The second sample is called as heterozygous:

scaffold1       1       .       A       T       77.522  .       DP=5;VDB=0.0058656;SGB=-0.556411;RPBZ=0;MQBZ=0;BQBZ=0;SCBZ=0;MQ0F=0;AC=1;AN=2;DP4=1,0,4,0;MQ=60 GT:PL:DP:SP:AD:QS:GP:GQ 0/1:112,0,26:5:0:1,4:41,164:7.84754e-13,0.995001,0.00499866:23

It would have much less effect on my downstream analyses if the two situations could instead be treated the same, for either of three possibilities:

Even though each of those possibilities would make the data less accurate overall, I think that the reduction in bias would lead to more accuracy in the downstream interpretation of results for certain analyses. Any chance something like this is possible with bcftools?

pd3 commented 8 months ago

There is the bcftools call --prior option which tells the program what is the expected mutation rate. By increasing it, one can increase the sensitivity. At the cost of reduced specificity, of course.