Differentiation between true wgd and artifact

manoharbisht1998 commented 2 months ago

Hi, Thanks for the tool and its proper documentation!

I successfully run the analysis for my paralogs and I got the results attached from wgd ksd, wgd syn. However, I am confused about the peak consideration. As you can see in the file with the paranome ks distribution, I am getting two peaks at 0.04 and 0.5. and I am also attaching the anchor pair file and wgd peak result of these anchor pairs. I think 0.5 ks value is more reliable as compared 0.05, but I am not able to figure this out how should I negiate the peak at 0.05 as false positive as it can also be due to transposon activity and subgenome divergence. Please help! WGD_paralogs.pdf final_true_cds_WH_single.fasta.tsv.ksd.pdf AnchorKs_PeakCI_final_true_cds_WH_single.fasta.tsv.ks.tsv_node_weighted.pdf

Thanks Manohar

heche-psb commented 2 months ago

Hi, how does the dotplot look like? If this species really experiences a very recent WGD, we expect to see a bulk of long collinear segments throughout most of its chromosomes.

manoharbisht1998 commented 2 months ago

Hi, thanks for the suggestion. I am attaching both the dot plot and the ks value dot plot here. However, the genome is not at the chromosomal level. can you able to figure out anything from this? and also, is there any way we can reduce the noise in these dot plots? final_true_cds_WH_single.fasta-vs-final_true_cds_WH_single.fasta.dot.pdf

final_true_cds_WH_single.fasta-vs-final_true_cds_WH_single.fasta_Ks.dot.pdf Thanks Manohar

heche-psb commented 2 months ago

Hi, a quick operational note, that you may set the option --minseglen 100000 and --mingenenum 50 to filter out short collinear segments in terms of the length in base and the number of residing genes, so as to drop those noisy dots, as well as setting --minlen 1000000 to skip those short scaffolds at the very beginning.

manoharbisht1998 commented 2 months ago

Hi, sure, I will use the --mingle and --mingenenum parameters. However --minlen is already set to 10% of the maximum scaffold length, right? so according to that it will be more than 1Mb, in my case. I will revert with new dot plots. Thanks!

heche-psb commented 2 months ago

Yes, the default --minlen is the 10% of the longest scaffold.

manoharbisht1998 commented 2 months ago

So I rerun wgd syn, and I am getting almost a similar plot. Can we identify any wgd event from this? (PS: the species is tetraploid) final_true_cds_WH_single.fasta-vs-final_true_cds_WH_single.fasta.dot.pdf

heche-psb commented 2 months ago

The Ks peak around Ks ~ 0 is often observed in transcriptome and fragmentary genome assemblies. The former is usually casued by the redundancy introduced from alternative splicing while the latter is usually caused by the misassembly that two fragments representing the same scaffold failed to be merged but remained as two highly similar fragments instead. One thing you may check is the location of those anchor pairs with Ks < 0.1. They might all come from small fragments instead of those longer ones, which will be indicative of problematic assembly. A second thing you may check is to annotate the transposable elements (TEs) among anchor pairs, by simply marking those anchor pairs identified as TEs with another color, so as to rule out the possibility of TEs-derived peak. Another practice you may try, is to first divide your input cds file as sub-genome A and sub-genome B, and then construct their respective Ks distributions and inspect the presence of Ks peak ~ 0. If at sub-genome level the Ks peak ~ 0 is gone, you may conclude that the incipient Ks peak ~ 0 actually marked the divergence of sub-genomes.

manoharbisht1998 commented 2 months ago

Thanks for the quick and constructive reply. I will check the origin of Anchor pairs. However, about the sub-genome part, my assembly is not at the chromosome level, but can we divide our input CDS or proteins at the Sub-genome level without dividing the genome, which is not possible for scaffold-level genome assembly?

heche-psb commented 2 months ago

As to the division of sub-genomes, you may refer to the latest paper https://doi.org/10.1016/j.tig.2024.03.008. I think k-mer based methods are possible for your assembly.

manoharbisht1998 commented 2 months ago

Thank you for the reply. However, most of the tool require a chromosome level genome assembly including K-mer based tool subPhaser. Further, to check the reason for the peak at Ks value 0.05, I think that could also be due to the presence of high segmental duplication (collinearity), which is around 60% in my genome using MCscanx and also evident from the anchor pair graph (attached above). And the high colleniarty among the subgenome is quit logical also?

heche-psb commented 2 months ago

Just my opinion, when it comes to tetraploid, if the subgenome can not be properly parted or accommodated, it should be claimed with caution for high segmental duplication observed at Ks peak close to zero, which is for sure a possibility but more a speculation to validate after ruling out other alternative scenarios.

manoharbisht1998 commented 2 months ago

The Ks peak around Ks ~ 0 is often observed in transcriptome and fragmentary genome assemblies. The former is usually casued by the redundancy introduced from alternative splicing while the latter is usually caused by the misassembly that two fragments representing the same scaffold failed to be merged but remained as two highly similar fragments instead. One thing you may check is the location of those anchor pairs with Ks < 0.1. They might all come from small fragments instead of those longer ones, which will be indicative of problematic assembly. A second thing you may check is to annotate the transposable elements (TEs) among anchor pairs, by simply marking those anchor pairs identified as TEs with another color, so as to rule out the possibility of TEs-derived peak. Another practice you may try, is to first divide your input cds file as sub-genome A and sub-genome B, and then construct their respective Ks distributions and inspect the presence of Ks peak ~ 0. If at sub-genome level the Ks peak ~ 0 is gone, you may conclude that the incipient Ks peak ~ 0 actually marked the divergence of sub-genomes.

Thanks for the suggestion, What output file from wgd syn, I should use to check the Ks value < 0.1?

heche-psb commented 2 months ago

Hi, you may use the *.anchors.ks.tsv file to inspect those anchor pairs with Ks < 0.1

manoharbisht1998 commented 2 months ago

Hi, So, I am checking for the mixture model clustering for anchor ks pairs, and I got the following result. I am confused about which component to consider for a true wgd Original_AnchorKs_GMM_Component4_node_weighted_Lognormal.pdf

heche-psb commented 2 months ago

Hi, the component 1 warrants your further inspect for TEs and residing segments while the component 0 is more clearly an evidence of WGD and component 3 is more likely the result of over-parameterization as to the excessive number of components.

manoharbisht1998 commented 2 months ago

Thank you. what would be the best way to inspect TE at component 1?

heche-psb commented 2 months ago

Hi, I suggest that first performing an extra TE annotation on those anchor pairs and second marking those detected TEs distinctly in the Ks distribution such that you could have a basic idea of the TE impact.

manoharbisht1998 commented 2 months ago

But my question was how TE annotation can be done in gene pairs (CDS sequences)?

heche-psb commented 2 months ago

Hi, I think you can resort to softwares that are specifically designed for TE annoation.

heche-psb / wgd

Differentiation between true wgd and artifact #30