Paleopolyploidy - Githubissues

xiaoyezao commented 2 years ago

Hi John,

I am working on plant genomes for which paleo-polyploidization is quite common. For example, many genomes from the sunflower family have undergone multiple rounds of polyploidization, although they are now diploid because of genome diploidization.

Related to the age of the WGD/WGT, the polyploidization was retained at different levels in the current diploidy genomes. How do we take this into account when setting the ploidy parameter?

Best wishes, Tao

jtlovell commented 2 years ago

Hi Tao, This is a very good question. The biorXiv version of the manuscript doesn't get into this issue at all, but it is one we discuss at length in the forthcoming eLife pub. I'm pasting some text from that article below to give details. But, in short - you need to embrace the phylogenetic context of your study.

In this case, coffee and grape should have ploidy = 1 lettuce = 2 sunflower = 4 Mmicrantha = 8 Ecanadensis = 2 Aannua, C. seticuspe, Cnankingesne = 4. There really isn't any way to get away from this from a genespace perpective ... unless the diploidization has occurred by eliminating an entire subgenome (which isn't super common afaik). You can simplify this run substantially by removing coffee and grape from the run. If there isn't much factionation, ploidy settings won't have a big impact since scores will be similar between homoelogs and orthogroups will span subgenomes regardless.

Text from the upcoming pub: Outgroups and the phylogenetic context of orthology inference. OrthoFinder defines orthogroups as the set of genes that are descended from a single gene in the last common ancestor of all the species being considered. As such, the scale of the run matters, often significantly. For example, an orthogroup would not be likely to contain homeologs across the two ancient sub-genomes for a run that included only two maize genomes. Since the coalescence of any two maize genotypes occurred more recently than the ~12 M ya WGD, few homeologs would both be descended from the same common ancestor when considering only maize genotypes. Hence, the within-maize NAM parent run (Figure 3D) excludes homeologs. However, if an outgroup to maize was included in the run, both maize homeologs would be likely to show common ancestry to a single gene in the outgroup, thus connecting the maize homeologs into a single orthogroup. Hence, both maize homeologous regions are present in the across- grasses synteny graph (Figure 3A) despite using identical synteny parameters to the maize NAM parent run. Given the potentially significant role of outgroups on the results of the global run, GENESPACE offers an ‘“outgroup’” parameter, which specifies the genomes that should be included in the orthofinder run but excluded from all downstream analyses.

alexvasilikop commented 2 years ago

Hello,

I have a similar dataset (paleotetraploid genomes with each of the 6 haploid chromosomes being homoelogous to another - 3 pairs of homoelogs per genome) and I run the analyses with ploidy 1 and also by not including any outgroup in the gpar parameters (the polyploidization happened before the common ancestor of all genomes in the dataset. Given your explanation I would expect that there are no secondary hits in the riparian plot (note: these secondary hits come from the homeologous chromosomes of the different species). What do these secondary hits mean in the riparian plot and how can I remove them? (each horizontal set is a haploid genome of a different species)

In addition I tried to run with ploidy= 2 to visualize all connections (between non-orthologous chromosomes as well but the graphics look a bit distorted except for the first 2 genomes). What is the reason for this? (see second plot)

jtlovell commented 2 years ago

So the top is treating the genomes as haploid, the bottom as diploid ... is that right?

If so, it generally looks like it is behaving correctly. Is the issue the three small apparent duplications between the top two genomes? If so, this is probably best solved by increasing the stringency of your synteny parameters (like we did for the cotton run in the paper). Depending on the age of the WGD, some over-retained regions will be in there by chance. Or you may have true regions of over-retained duplicates (like in peanut, the grasses, etc.)

On another note, it looks like you may have an issue with your assembly - most chromosomes ends have large regions with few genes (or little synteny) ... it could also be that the assembly is getting confused between your homeologs.

alexvasilikop commented 2 years ago

Yes this is correct (top is ploidy=1 and bottom is ploidy=2) I was just wondering why it shows these connections/duplications to the homeologs if the ploidy is 1 and no outgroup is used. I thought the aim is to map syntenic orthologous regions uniquely across genomes?

The synteny seems to be conserved only at the middle of the chromosomes in these species. We have observed that the telomeric regions seem to be highly dynamic in these species possibly also with a lot of reshuffling and foreign genes (but this is not the case for all chromosomes) e.g. orthologs of chrom2 from the bottom (purple color). So it is likely not an issue of the assembly (Hi-C data confirmed the chromosomes when scaffolding) but it is due to lack of synteny. I was wondering why in the second plot the braids appear faded except for the first two species). it seems an issue with the graphics?

jtlovell commented 2 years ago

" just wondering why it shows these connections/duplications to the homeologs if the ploidy is 1 and no outgroup is used" can you point to what you mean? I am not seeing this besides the three minor ones I describe above.

"I was wondering why in the second plot the braids appear faded except for the first two species). it seems an issue with the graphics?" I have no idea ... this looks like a rendering (or maybe copy/paste) issue.

alexvasilikop commented 2 years ago

I think we are referring to the same thing. e.g. this pink braid mapping on the red region from the homeologous chromosomes at the very top of the first image. Using more stringent criteria helps but then the synteny is more and more fading at the centers of chromosomes

jtlovell commented 2 years ago

Then my guess is it's real. See my comments above. See fig 4 where we dealt with a similar complexity.

alexvasilikop commented 2 years ago

Ok thanks but just to be sure I understand: Is it because the homoeologs in these regions are too similar to their orthologous regions and cannot be clearly distinguished (based on the stringency criteria)?

jtlovell commented 2 years ago

Thats right - genespace subsets to the top n hits for each gene (with haploid, this means 1), then applies the other synteny parameters. If you cannot exclude a region by increasing stringency of the other syn params without altering the synteny map itself, then it likely means that the regions have very similar sequences. Make a dotplot like fig 4 to be sure.

jtlovell / GENESPACE

Paleopolyploidy #28