jonassibbesen / rpvg

Method for inferring path posterior probabilities and abundances from pangenome graph read alignments
MIT License
47 stars 6 forks source link

Haplotype probability and read count #57

Open agolicz opened 1 year ago

agolicz commented 1 year ago

Hi, We've been trying out rpvg on a crop plant and it looks very promising! I have a question regarding inferring haplotype probability when read counts are 0. Index was built using autoindex and quantification was done in --inference-model haplotype-transcripts mode.

I have cases where both single and joint haplotype probability is 1, but read count is zero. For example:

fgrep -f key.txt 843A.mpmap.gamp.rpvg_joint.txt
Name_1  Name_2  ClusterID       HaplotypingProbability  ReadCount_1     TPM_1   ReadCount_2     TPM_2
C03p009100.1_BnaEXP_H1  C03p009100.1_BnaEXP_H1  54565   1       0       0       0       0

Since they all seemed to be _H1 (just from visual inspection of results), I thought that maybe if there is only one haplotype, it is automatically assigned probability of 1.

But when I then looked at marginal haplotype probabilities they could be either 1 or 0.

fgrep -f key2.txt ../843A.mpmap.gamp.rpvg.txt
Name    ClusterID       Length  EffectiveLength HaplotypeProbability    ReadCount       TPM
C03p009100.1_BnaEXP_H1  54565   573     407.61853       1       0       0
fgrep -f key2.txt ../845A.mpmap.gamp.rpvg.txt
Name    ClusterID       Length  EffectiveLength HaplotypeProbability    ReadCount       TPM
C03p009100.1_BnaEXP_H1  116249  573     407.63127       0       0       0

How are haplotype probabilities derived if there are no reads mapped? Could it be that there were some reads supporting assignment but they did not make it into count?

All the best, Agnieszka

jonassibbesen commented 1 year ago

Hi Agnieszka,

Thank you for writing. The diplotypes are inferred using all haplotype-specific transcripts (HST) in a cluster so it is possible to have individual HSTs with zero inferred read count and a haplotyping probability of 1. Another situation where this can also happen is if there is only one haplotype-specific version of a transcript. What is happening in your specific case I am not sure. I am happy to look more into it if you are able to share the results for both 843A and 845A. I would also need the --path-info input. You can share it using jonassibbesen@gmail.com.

agolicz commented 1 year ago

Thank you! I've sent you an email. My main concern is making sure that the genes we look at are always present, so we don't mix up cases of abolished expression due to regulatory changes and complete gene deletion. In plants gene presence/absence variation is a common occurrence. Agnieszka