Is it possible/appropriate to determine the WGD events from the genome coverages?

zengxiaofei commented 7 years ago

In many whole genome de novo sequencing project, the scaffolds and contigs are not assembled into pseudomolecules. As a result, it's very difficult to determine exact ratio between subject genome and reference genome from a synteny map.

According to your description in README.rst, quota_align.py can calculate the genome coverages from a specified ratio, and the coverage will be two low if a wrong ratio is specified. So my question is, is it possible or appropriate to determine the exact ratio between two genomes?

Here are two real examples

Example 1:

sp1: the species we studied (unknown) Cca: Coffea canephora (no WGD event after γ) Vvi: Vitis vinifera (no WGD event after γ) Sly: Solanum lycopersicum (genome triplicated after γ)

sp1 vs Cca

--quota	genome X coverage (sp1)	genome Y coverage (Cca)
1:1	55.6%	95.6%
2:1	85.4%	97.0%
3:1	95.6%	96.9%
4:1	95.7%	96.9%
6:1	95.7%	96.9%

sp1 vs Vvi

--quota	genome X coverage (sp1)	genome Y coverage (Vvi)
1:1	58.7%	93.3%
2:1	84.4%	94.9%
3:1	93.3%	94.5%
4:1	93.5%	94.2%
6:1	93.5%	94.2%

sp1 vs Sly

--quota	genome X coverage (sp1)	genome Y coverage (Vvi)
1:3	72.5%	95.5%
2:3	92.6%	99.5%
3:3	99.6%	99.6%

Question:

Can I infer that sp1 genome underwent a whole genome triplication after γ?

Example 2:

sp2: the species we studied (unknown) Cca: Coffea canephora (no WGD event after γ)

sp2 vs Cca

--quota	genome X coverage (sp2)	genome Y coverage (Cca)
1:1	37.5%	95.0%
2:1	60.5%	97.1%
3:1	74.3%	95.9%
4:1	83.4%	95.9%
6:1	89.6%	95.6%
8:1	90.3%	95.5%

Question:

Can I infer that sp2 genome underwent a round of whole genome triplication and a round of whole genome duplication (3 * 2 = 6) after γ?

I examined this method in Arabidopsis vs grape, Arabidopsis vs Brassica rapa and poplar vs peach. It seemed to work well.

Thanks for your attention! Xiaofei Zeng

tanghaibao commented 7 years ago

@zengxiaofei I have been struggling with finding an objective method to call WGD ploidies over the past few years. The method you described might work although the cutoff (beyond which the coverage saturates) is a bit difficult to call sometimes. What BLAST filtering option did you use? I would often like to filter the results to only reciprocal best hits (blast_to_raw.py, use something like --cscore=.99) for these types of analyses.

zengxiaofei commented 7 years ago

@tanghaibao Thank you for your reply!

First of all, please forgive me for deleting the figures and modifying the spcies names in this issue. I used --score=.5 for these analyses yesterday. And I also tried --score=.99. Here are the results:

Example 1:

sp1 vs Cca

--quota	genome X coverage (sp1)	genome Y coverage (Cca)
1:1	58.4%	94.3%
2:1	87.7%	96.9%
3:1	95.4%	97.2%
4:1	95.4%	97.2%
6:1	95.4%	97.2%

sp1 vs Vvi

--quota	genome X coverage (sp1)	genome Y coverage (Vvi)
1:1	64.4%	92.7%
2:1	90.7%	94.7%
3:1	93.6%	94.7%
4:1	93.6%	94.6%
6:1	93.6%	94.6%

Does it make the guess 3:1 more reliable?

Example 2:

sp2 vs Cca

--quota	genome X coverage (sp2)	genome Y coverage (Cca)
1:1	47.9%	93.2%
2:1	73.5%	94.7%
3:1	85.5%	93.8%
4:1	89.1%	94.0%
6:1	89.8%	93.8%
8:1	89.8%	93.8%

Can I still infer 6:1? It became difficult to distinguish 4:1 and 6:1 while the actual ratio is too high.

tanghaibao / quota-alignment