Posterior probabilities (posterior0.csv) change when rerunning individuals with same parameters

hringbauer / hapROH

Software to call ROHs

GNU General Public License v3.0

15 stars 3 forks source link

Posterior probabilities (posterior0.csv) change when rerunning individuals with same parameters #13

Closed ngaens closed 4 months ago

ngaens commented 4 months ago

Hi Harald,

When I try running hapROH with the pseudohaploid emission model multiple times for the same set of imputed individuals with identical parameters, I obtain very different results in ROH blocks of all lengths (2-20 cM). The generated hap.csv, map.csv and pos.csv files are identical, but the posterior probabilities as well as the total log likelihood are markedly different. I cannot really figure out why there is so much difference. As I am quite new to these analyses, it is probably a user error on my side, but I would very much like to understand the nature of this difference.

I followed the parameters of the notebook, but am happy to share my input and output files if needed.

Best, Noah

hringbauer commented 4 months ago

Hi @ngaens, while the hapROH algorithm is fully deterministic (using an HMM that gives the same output for the same input), you are likely doing random read sampling from your diploid input:

The hapROH parameter setting random_allele=True means that one of the two alleles of a heterozygote is sampled at random by hapROH

You could either 1) Pseudo-haploidize your data before running hapROH (e.g. picking one read at random yourself and encoding homozygotes for that one - as typical in aDNA) 2) Set a "fixed" random seed in Numpy 3) Run the diploid mode of hapROH on your diploid data

ngaens commented 4 months ago

Hi Harald,

Thank you very much for your reply.

Indeed, I am using random sampling. However, wouldn't the hap.csv files be different when hapROH samples random alleles? Now the contents of these files are identical in runs with different ROH calls (I am probably misinterpreting the hap.csv file).

I ran the diploid mode as well, but I seems that longer ROH blocks are broken up by gaps in diploid mode. I suppose this mode might be less tolerant to imputation errors. May I ask if you would recommend either pseudo-haplodizing imputed data or running it in diploid mode?

Thank you again and all the best, Noah

hringbauer commented 4 months ago

Default hapROH that is well tested is for pseudo-haploid data. The diploid mode is intended for high-quality diploid data (such as SNP arrays or high coverage WGS), as it does not take into account the genotype probabilities (which might be low).

So I strongly recommend you pseudo-haploidizing the unimputed data as widely done in aDNA and running default hapROH

There is no need for imputing outside hapROH (and it could introduce bias), as hapROH runs an imputation HMM itself - but as it only tries to copy from one reference haplotype it will also work for lower coverages (where diploid imputation fails).

hringbauer commented 4 months ago

Regarding the detailed question (see the general suggestion above), hap.csv is saved just before the random allele shuffling. I will update that in a future release to make it more intuitive (saving the data as is used)!

ngaens commented 4 months ago

Hi Harald,

Thank you very much for your swift reply and recommendation. This certainly resolves any uncertainties I was having. Please accept my apologies for any overly simplistic questions. I will close this issue now.

Best wishes, Noah