Non-arbitrary selection of rohmu parameter?

jordanbemmels commented 3 years ago

Hello,

I was wondering if there is any additional advice for selecting an appropriate rohmu parameter in a way that is non-arbitrary. As I understand it, rohmu is the inferred heterozygosity rate that can be tolerated in an ROH. The presence of "heterozygous" sites in a true ROH could be due to de novo mutations, sequencing error, etc.

I find that with an empirical dataset, I can vary the rohmu parameter and then get very different proportions of the genome that are inferred to be in ROHs. It is unclear to me how to tell which parameter setting is most appropriate, and this has a major effect on my central question: is this population experiencing much inbreeding or not?

The website tutorial suggests examining output plots and if it visually looks like there are regions of reduced heterozygosity that did not get flagged as ROH, then try increasing rohmu. This makes sense in principle, but is difficult to know whether I am looking at regions that are true ROHs, or they are a low-diversity for other reasons (genomic architecture, selective sweep, etc.). It's also difficult to do this eyeballing with a scaffold-level assemblies (hundreds of tiny plots) rather than chromosome-level assemblies.

I noticed that publications recently citing ROHan do not usually report how they determined their rohmu setting.

Some ideas I thought of:

1) Keep increasing rohmu until stopping at the highest possible value that still results in zero ROHs identified in at least one individual in my population. This would suggest that rohmu cannot yet be too high that it has mistakenly started identifying low-diversity regions as ROHs. However, if all individuals are highly inbred, then this method would result in a rohmu value that is too low and I would miss ROHs.

2) Examine a histogram of heterozygosity estimates in windows across the genome, and see if it is clearly bimodal with one peak likely corresponding to ROH regions and another peak corresponding to non-ROH regions. This didn't work so well in my species as the distribution for different individuals was often unimodal, or it was bimodal but the peaks overlapped greatly. Perhaps my expectations were also not correct.

3) Calculate an empirically expected background heterozygosity in true ROHs based on mutation rate, recombination rate, sequencing error... but I don't know of such a formula.

Thanks! I'd gladly welcome suggestions.

Jordan

grenaud commented 3 years ago

Dear Jordan, Yes this is indeed a weakness of the program. I am currently on paternity leave and I use my 1h or so every evening to answer emails and I can do very little work these days. Could you re-ping during the summer when my paternity leave will be over?

Thank you for your message!

jordanbemmels commented 3 years ago

Wow, exciting!! Congratulations!!! 🎉

Don't worry about replying before summer, but I have been thinking about it a bit more and I thought I'd post what I came up with in case it is helpful for other users to see.

I am thinking of a modification of #2 above. In my situation I have 55 individuals distributed in several species of the same genus of bird, but different species have quite different genetic diversity levels. I want to set rohmu high enough so that it correctly identifies true ROHs, but still low enough so that it doesn't incorrectly identify false ROHs in low-diversity regions of the lowest-diversity species.

I first calculated the average heterozygosity in all windows across all 55 individuals, and took the bottom 5th percentile, to identify genomic windows (filtered for a minimum number of sites with data) that are consistently low genetic diversity across species.

Then, I plotted a histogram of heterozygosity in these windows, only for the 5 individuals with the lowest overall genetic diversity in my data. This was to get an idea of the distribution of heterozygosity in the consistently low-diversity windows in the lowest-diversity species. I saw that this distribution is strongly bimodal (attached pdf). I inferred that the lower peak is likely regions that are ROH due to inbreeding, and the higher peak is likely non-ROH regions. Then, I chose a value of 5e-5 (red vertical line) that is close to but a little bit less than the minimum value of the second (higher) peak, because I assume that the left tail of this second peak represent the lowest diversity I might expect to observe in my dataset in my lowest-diversity species in non-ROH regions. For the higher-diversity species I presume a higher cutoff may be more reasonable, but I can't push rohmu any higher or else I will start incorrectly identifying non-ROH regions as ROH in this lowest-diversity species.

Anyway, it's a bit less arbitrary. Perhaps this only applies to my experimental setup but I shared here in case someone finds it helpful. (Criticisms welcome!).

06_rohmu_He_plot.pdf

sagitaninta commented 3 years ago

Hi Jordan,

I am facing the same problem with ROHan and currently resorted to number 3. My way is to identify all male samples from the ratio of mapped reads between X chromosome and autosomes and then calculate the heterozygosity of X chromosome of male samples as background heterozygosity for --rohmu. Of course, this only works if you are working on the same diploid taxa with similar sexual chromosome system.

Number 2 is actually a good idea and I am inclined to confirm my results with that. I can point you also to some papers that I think have been testing that in their supplementary material. Happy to chat further about this while Gabriel is busy! :p Feel free to e-mail me at s.g.aninta[at]qmul.ac.uk.

Best, Sabhrina

janawold1 commented 9 months ago

Hello! Just pinging this thread again to see if there is a 'prefered' method for inferring --rohmu?

grenaud commented 9 months ago

@janawold1 I have not had time so far to find a more clever way of estimating rohmu. My apologies! Currently what I recommend is: 1) run ROHan with a default --rohmu 2) look at a PDF with the different states of the hidden markov model 3) do you see clear runs of homozygosity that don't get flagged as ROHs? Then you might need to increase rohmu. 4) do you see regions that are clearly not ROH that get flagged as ROH? If so, you need to decrease rohmu

Here's something very important: you do not need to rerun the whole theta estimation for the HMM decoding use the "--hmm".

Alternatively, you can run once with default parameters and default --rohmu and rerun with --hmm with various values of --rohmu and see what works best.

janawold1 commented 8 months ago

Fantastic! Thank you for that clear explanation. :)

janawold1 commented 8 months ago

Sorry to drop by again, I'm hoping to make sure I have a full understanding of the --rohmu parameter. As I understand it, this is the amount of heterozygosity within a ROH you're willing to tolerate. So within a 50kb window, a rohmu of 5e-5 should equate to 2.5 heterozygous sites within that window (50,000 * 5e-5)?

Apologies if that's completely off the mark!

grenaud commented 8 months ago

Hi Jana, You are absolutely right, it is the number of " heterozygous sites" in an ROH you are willing to tolerate. There are many reasons for those, 1) incorrect call due to mismapping 2) somatic mutations 3) bona fide de novo mutations etc...

Let me know if I can assist you further or if you need help interpreting output. My advice is 1) run the program with an initial value of rohmu 2) check the hmm decoding 3) rerun without re-running the estimate of heterozygosity with either a higher or lower ROH.

Please note that I have seen certain examples but a heterozysity so low that there's no way that you can distinguish an ROH from a non-ROH.

zsdxgl commented 2 months ago

Hi Jana, You are absolutely right, it is the number of " heterozygous sites" in an ROH you are willing to tolerate. There are many reasons for those, 1) incorrect call due to mismapping 2) somatic mutations 3) bona fide de novo mutations etc...

Let me know if I can assist you further or if you need help interpreting output. My advice is

run the program with an initial value of rohmu

check the hmm decoding

rerun without re-running the estimate of heterozygosity with either a higher or lower ROH.

Please note that I have seen certain examples but a heterozysity so low that there's no way that you can distinguish an ROH from a non-ROH.

Could you give the examples in detail and is it published?

grenaud commented 2 months ago

Hi! For 1) I would simply run the default values For 2) there are het and hmmp pdfs being generated, have a look manually to see if you can make out ROHs that do not get flagged. For 3), rerun with a --hmm to skip the computation of the heterozygosity rate. Set rohmu to something higher if you see clear ROHs that do not get flagged.

Yes: Gabriel Renaud, Kristian Hanghøj, Thorfinn Sand Korneliussen, Eske Willerslev, Ludovic Orlando, Joint Estimates of Heterozygosity and Runs of Homozygosity for Modern and Ancient Samples, Genetics, Volume 212, Issue 3, 1 July 2019, Pages 587–614, https://doi.org/10.1534/genetics.119.302057

`@article{10.1534/genetics.119.302057, author = {Renaud, Gabriel and Hanghøj, Kristian and Korneliussen, Thorfinn Sand and Willerslev, Eske and Orlando, Ludovic}, title = "{{Joint Estimates of Heterozygosity and Runs of Homozygosity for Modern and Ancient Samples}}", journal = {Genetics}, volume = {212}, number = {3}, pages = {587-614}, year = {2019}, month = {05}, abstract = "{Both the total amount and the distribution of heterozygous sites within individual genomes are informative about the genetic diversity of the population they belong to. Detecting true heterozygous sites in ancient genomes is complicated by the generally limited coverage achieved and the presence of post-mortem damage inflating sequencing errors. Additionally, large runs of homozygosity found in the genomes of particularly inbred individuals and of domestic animals can skew estimates of genome-wide heterozygosity rates. Current computational tools aimed at estimating runs of homozygosity and genome-wide heterozygosity levels are generally sensitive to such limitations. Here, we introduce ROHan, a probabilistic method which substantially improves the estimate of heterozygosity rates both genome-wide and for genomic local windows. It combines a local Bayesian model and a Hidden Markov Model at the genome-wide level and can work both on modern and ancient samples. We show that our algorithm outperforms currently available methods for predicting heterozygosity rates for ancient samples. Specifically, ROHan can delineate large runs of homozygosity (at megabase scales) and produce a reliable confidence interval for the genome-wide rate of heterozygosity outside of such regions from modern genomes with a depth of coverage as low as 5–6× and down to 7–8× for ancient samples showing moderate DNA damage. We apply ROHan to a series of modern and ancient genomes previously published and revise available estimates of heterozygosity for humans, chimpanzees and horses.}", issn = {1943-2631}, doi = {10.1534/genetics.119.302057}, url = {https://doi.org/10.1534/genetics.119.302057}, eprint = {https://academic.oup.com/genetics/article-pdf/212/3/587/42104856/genetics0587.pdf}, }

`

grenaud / ROHan

Non-arbitrary selection of rohmu parameter? #12