samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
666 stars 240 forks source link

bcftools roh fails for samples used to generate the reference genome #1909

Open A-J-F-Mackintosh opened 1 year ago

A-J-F-Mackintosh commented 1 year ago

Hi,

I have been using bcftools roh to estimate runs of homozygosity. I noticed that the individual used to generate the reference genome sequence had no roh calls, despite the fact that I have visualised a roh in this individual previously.

This individual will almost always have the genotypes 0/0 or 0/1 at a given site, as 1/1 would suggest an error in the genome assembly or variant calling.

The bcftools roh HMM calculates the likelihood of the hidden states (N = non-roh, H = roh) as follows:

P(Di|Xi=H)=(1–fi)P(Di|RR)+fiP(Di|AA)

P(Di|Xi=N)=(1–fi)2P(Di|RR)+2fi(1–fi)P(Di|RA)+fi2P(Di|AA)

My problem, I think, is that I have almost zero 'AA' genotypes (i.e. 1/1) when using the reference individual, so these probabilities are not meaningful.

Do you have any suggestions to get around this problem? One option would be to make a pseudo-vcf where alleles in this individual are flipped at a frequency of 0.4 (the default AF value).

Any advice would be very welcome,

Alex

pd3 commented 1 year ago

That's a correct observation. Hopefully a different solution might work. As you've shown above, one of the key inputs of the algorithm are population allele frequencies. If the AF tag indicates the reference allele is the rare allele, it should still correctly predict a stretch of homref genotypes as a RoH region. For this, alternate allele must be present in the ALT column, even when unused, because monoallelic sites are skipped.