jeffersonfparil / compare_genomes

A comparative genomics workflow using Nextflow, conda, Julia and R
GNU General Public License v3.0
32 stars 6 forks source link

RE: 4DTv and WGD #17

Open Malabady opened 5 months ago

Malabady commented 5 months ago

Hello Jefferson,

My analysis includes my species of interest plus six other species. In my "comparisons_4DTv.txt", I listed single species as well as pairs of species, as follows:

`` mySpecies SpeciesA SpeciesB SpeciesC mySpecies X SpeciesA mySpecies X SpeciesB mySpecies X SpeciesC mySpecies X SpeciesD mySpecies X SpeciesE mySpecies X SpeciesF


Species A, and B are the closest species to mySpeices in that order. Species C, D, and E are from a different family. Species F is included as a monocot outgroup. 

My questions:
1) In the 4DTv density plot, I only see  plots for the comparisons, but not the single species that i listed. Is this expected? 

2) All plots have non-zero peaks at different 4DTv accumulation rates. I am a bit unclear about the meaning of it. Does it mean that there was one WGD event but the different 4DTv peaks are just related to the divergence time between the two species? From the WF script, I see that it uses the pairwise paralogs and pairwise orthologs of the compared species. For instance, I assume, the density plot for "mySpecies X Species A" include the 4DTV rates from the pairwise paralogs from mySpecies and SpeciesA plus the pairwise Orthologs between mySpecies and SpeciesA. And so on. that's why the plots have different peaks, but all refer to a single WGD events.  Is this accurate understanding? 

2) In the main output directory, there is a "*.4DTv" file for every species in the analysis. When I made a density plot for these files using the 4th column (assumed to be 4DTv rates) of each file, the plot is totally different from the plot generated by the pipeline.  the files have the statistics for only pairwise genes, which is what the WF uses I assume. so, why the plots are different? in fact, my density plot of the 4DTv of mySpecies (attached), has only a zero-peak, which is different from what the pipeline produces in the comparisons.

![image](https://github.com/jeffersonfparil/compare_genomes/assets/9359920/ea3ddb6d-0e05-45f8-a9f8-e875ce456e81)

I am sure I am missing many things here, I really appreciate it if you could point them out. 
Many thanks.
jeffersonfparil commented 5 months ago

1) In the 4DTv density plot, I only see plots for the comparisons, but not the single species that i listed. Is this expected?

Very curiosome results! This must be species name parsing issues. Have you tried manually running the R script from here?

jeffersonfparil commented 5 months ago

2) All plots have non-zero peaks at different 4DTv accumulation rates. I am a bit unclear about the meaning of it. Does it mean that there was one WGD event but the different 4DTv peaks are just related to the divergence time between the two species? From the WF script, I see that it uses the pairwise paralogs and pairwise orthologs of the compared species. For instance, I assume, the density plot for "mySpecies X Species A" include the 4DTV rates from the pairwise paralogs from mySpecies and SpeciesA plus the pairwise Orthologs between mySpecies and SpeciesA. And so on. that's why the plots have different peaks, but all refer to a single WGD events. Is this accurate understanding?

Yes this is what is being plotted refer to the same lines in the Rscript as above.

jeffersonfparil commented 5 months ago

3) In the main output directory, there is a "*.4DTv" file for every species in the analysis. When I made a density plot for these files using the 4th column (assumed to be 4DTv rates) of each file, the plot is totally different from the plot generated by the pipeline. the files have the statistics for only pairwise genes, which is what the WF uses I assume. so, why the plots are different? in fact, my density plot of the 4DTv of mySpecies (attached), has only a zero-peak, which is different from what the pipeline produces in the comparisons.

This is unexpected. Let me know how manually running the Rscript goes for you.

Malabady commented 5 months ago

Hello,

Thank you for the replies. As you suspected, there was a failure in parsing the "comparison" file because all single species lines had an extra tab at the end before the new line character. So, by fixing this issue, I can now get both the individual species and comparison plots. Using the R script directly, I got the following plot:

4DTv_Rplot_legend_2.pdf

As you can, the individual species have a 4DTv rate peaks at "0" but the comparisons have peaks at various 4DTv rates. I am not how to interpret it. is this expected? what's the difference between the single species and comparison peaks? I understand the difference in the plotted data, but unclear about the interpretation.

Many thanks for your help,

Best,

jeffersonfparil commented 5 months ago

You can think of the between species-pair 4DTvs as divergence times between their shared genomes. These comparisons are usually visualised/assessed when you believe the two species share a common ancestor, or one is hypothesised to be the parent of another - similar to my discussion point in the paper: "Arabidopsis suecica, an allopolyploid hybrid of A. thaliana and A. arenosa..."