jtlovell / GENESPACE

Other
188 stars 26 forks source link

Interpretation of output files Dotplots-Riparian plots #24

Closed alexvasilikop closed 2 years ago

alexvasilikop commented 2 years ago

Hello John,

I have a question regarding the interpretation of the output dotplot files of GENESPACE. I see that the output of plotting the riparian hits gives also 2 dotplots in the riparian directory for each genome pair. What do these dotplots show? Do you think you could add this in the documentation (vignette)?

I assume that each dot is a pair of orthologs and if they are colored then they are syntenic (i.e. found on the same chromosome)? In addition what is the gene rank order that is given on the x,y axes?

In addition, what do the lines in the riparian plots actually represent? Would you say they are blocks of collinear ortholog pairs among the compared species depending on the parameters specified (e.g. min number of genes (blocksize) = 5 and max no. of gaps allowed for a block sise =25)?

I have a genome species which is not included in the riparian plot and I assume this is due to lack of synteny. Could I use for example the first dotplot (non-synteny constrained OGs to show that there is no synteny between the compared genomes?)

Many thanks

jtlovell commented 2 years ago

The dotplots have changed a bit from version to version - the first plot is (depending on the version) all hits where both the query and target are in the same orthogroup. The second is all syntenic hits. The goal of the dotplots is to give a quick visual of what synteny looks like between any two pairs of genomes. If you want to make a more visually appealing dotplot, I'd recommend reading in the corresponding .synHits.txt.gz file and plotting the positions using ggplot2 and faceting by chromosomes.

The next release will have a better dotplot ... its just been tricky to get good looking dotplots without making really big files. Thanks for the recommendation - I'll update the documentation for the next release.

The default riparian plot is the physical or gene-rank order positions of syntenic blocks, colored by chromosomes of a chosen reference genome. The coordinates are derived from the positions of "anchor" hits, which, depending on your parameter specifications, may or may not be only hits where both the query and target genes are in the same orthogroup (the default onlyOgAnchors = TRUE). Anchor hits are defined by MCScanX as you describe, then clustered into blocks using dbscan.

The current version of GENESPACE should error out if any of your pairs of genomes do not have synteny. So, I'm not sure what to make of that ... could it be that genome2 --> genome1 is in there? synteny works only on unique combinations (not reciprocal).

alexvasilikop commented 2 years ago

Hi John,

Thanks for your reply. The version I am using is 0.9.4 (pre-release) so it should be the one you are referring to (the latest). Actually yes you are right that genespace initially throws an error if there is not enough synteny but if you relaxe the synteny criteria (blksize=2, n_gaps=60) the script terminates normally but the genome that does not share a lot of synteny with the others is not printed in the riparian plot (although the result dotplot files are printed). See also here: https://github.com/jtlovell/GENESPACE/issues/21 ############################################# Output: Aric: 4294 genes in 1876 collinear arrays Asp: 6649 genes in 2923 collinear arrays Avaga: 2820 genes in 1295 collinear arrays Bc: 6967 genes in 2387 collinear arrays Pulling synteny for 10 unique pairwise combinations of genomes Running 10 chunks of up to 1 combinations each: Chunk 1 / 10 (11:55:03) ... Done! Asp-Ar: 524897 (tot), 50105/46 (reg), 24356/1896 (blk) Chunk 2 / 10 (11:56:27) ... Done! Asp-Av: 506457 (tot), 52676/42 (reg), 25446/2619 (blk) Chunk 3 / 10 (11:57:53) ... Done! Ar-Av: 419189 (tot), 44323/35 (reg), 22888/2336 (blk) Chunk 4 / 10 (11:59:19) ... Done! Asp-Bc: 159146 (tot), 5169/52 (reg), 849/542 (blk) Chunk 5 / 10 (11:59:35) ... Done! Ar-Bc: 152947 (tot), 5949/60 (reg), 955/630 (blk) Chunk 6 / 10 (11:59:53) ... Done! Av-Bc: 149259 (tot), 5118/55 (reg), 828/532 (blk) Chunk 7 / 10 (12:00:07) ... Done! Asp-Asp: 608255 (tot), 53925/9 (reg), 39055/9 (blk) Chunk 8 / 10 (12:00:28) ... Done! Ar-Ar: 453528 (tot), 46260/11 (reg), 33194/11 (blk) Chunk 9 / 10 (12:00:42) ... Done! Av-Av: 403019 (tot), 40889/6 (reg), 31327/6 (blk) Chunk 10 / 10 (12:00:54) ... Done! Bc-Bc: 244440 (tot), 47038/12 (reg), 20916/12 (blk) Defining synteny-constrained orthogroups ... Found 63712 synteny-split OGs for 124547 genes Found 63712 OGs across 124547 genes. gff3-like text file written to: ~/Documents/genespace//results/gffWithOgs.txt.gz Calculating syntenic block breakpoints ... Found 17148 blocks. Text file written to: ~/Documents/genespace//results/syntenicBlocks.txt.gz: #########################################################

The species I am referring to above is Bc which is not printed in the riparian plot but the script terminates without error. However, the first dotplots are generated and one of them (for Bc and one of the other species) looks like this. What are the colored dots in this case? What do you mean by syntenic hits? Pairs of Orthologs that are within collinear blocks? image

jtlovell commented 2 years ago

wow - congrats on figuring out how to break the riparian plotter! ;-) ... I think the issue is that inside of the riparian plotter is a parameter that says do not plot any chromosomes without n anchor hits. minGenes in plot_riparian and minGenes2plot in plot_riparianHits. I bet if you set those to 1, you'd get that genome plotting in the riparian. I'll add to my to do list, to have these parameters respect the blockSize setting in the gsParam object.

GENESPACE is meant to work on genomes with good synteny. As synteny degrades, the value of constraining to syntenic regions does so too. In this case, there is hardly any synteny (maybe none) ... I would highly recommend dropping that genome from your analysis - the colored points are what GENESPACE thinks are syntenic regions given your parameters. Obviously not a lot going on there.

alexvasilikop commented 2 years ago

Thanks John this is a more distantly related species than the others and one of the questions was to see if there is any identifiable synteny.. At least it seems with respect to the provided parameters no much synteny can be identified. However, since I am interested in the evolution of chromosome structures among these distantly related species probably I should use a different approach with a predefined set of ancestral linkage groups that have deeply conserved synteny to interrogate chromosomal evolution (e.g. see here: https://www.science.org/doi/10.1126/sciadv.abi5884). In any case thanks a lot!