tanghaibao / jcvi

Python library to facilitate genome assembly, annotation, and comparative genomics
BSD 2-Clause "Simplified" License
757 stars 186 forks source link

Combining collinear orthologs from multiple species. #376

Closed Zhuxitong closed 2 years ago

Zhuxitong commented 3 years ago

Hello, haibao

I am now working on 8 species and they are very close to each other. I know that I could first do pairwise comparison and then select one reference (like specie A in A-B and A-C) to combine all collinear genes with the help of jcvi.formats.base join command. However, this will largely rely on the reference selected and collinear genes that are absent on A but present on B and C will miss in the final result.

So I am thinking if there is a way to combine all collinear genes based on the pairwise comparison. A paper named "Genomes of 13 domesticated and wild rice relatives highlight genetic conservation, turnover and innovation across the genus Oryza." pointed out one way in their method:

Syntenic relationships between orthologous genes were mapped for all 78 pairwise combinations of the 13 reference assemblies. We used DAGchainer to identify collinear gene pairs within syntenic blocks, with parameters requiring neighboring genes to be no more than ten genes apart and a minimum chain length of five collinear genes. As this method is strict, we additionally identified ‘in-range’ syntenies of orthologs that mapped no more than five genes distant from the expected collinear position. Clustered sets of syntenic genes encompassing all species were identified by single-linkage clustering over the pairwise relationships.

Pairwise comparison can be done with MCScanX in replace of DAGchainer, but I am not sure what the mean is of the last sentence. Since in your doc you also mentioned single-linkage clustering, I am wondering if you have any ideas?

Any suggestions is very appreciated as it has disturbed me a lot.

tanghaibao commented 3 years ago

@Zhuxitong

You can still perform pairwise comparisons between B-C and get all the syntenic genes, in .anchors file but it is just a little difficult to add them in the .blocks file since it is a little tricky to order the pairs without a common column. You can manually build your .blocks file by combining A-B-C and B-C in the same file.

a1   b1   c1
a2   b2   c2
.    b3   c3

If all you want is to plot the synteny based on the .blocks file, you don't have to worry about the sorting of the rows, and you can just combine the pairwise .anchors that way, and not missing any signal that you are afraid of losing.

The cited text is the same underlying method used in the jcvi package - single linkage clustering, N genes apart, and min chain length. jcvi, MCScanX, and DAGchainer are the same family of methods, they differ in their specific criteria in calling a block. However, all blocks are intrinsically "pairwise" and you need extra efforts to combine them into the same blocks.

Zhuxitong commented 3 years ago

Hi haibao,

Really thanks for your timely reply. I may not implement the single-linkage clustering methods as it is still difficult to me. The way that you mentioned I need to add B-C into .blocks manually will work I think. But for my 8 species, this may takes more ime. So I am thinking if I could select different references once a time (like A in A-B, A-C, and then B in B-A, B-C) and join the anchors for each reference. At last, I shall merge all anchors and sort and remove redundancy like so:

For reference A a1 b1 c1 a2 b2 c2 For reference B b1 a1 c1 b2 a2 c2 b3 c3 . For reference C c1 a1 b1 c2 a2 b2 c3 b3 . c4 . . Then merge them a1 b1 c1 a2 b2 c2 b1 a1 c1 b2 a2 c2 b3 c3 . c1 a1 b1 c2 a2 b2 c3 b3 . c4 . . Remove redundancy as indicated in italic a1 b1 c1 a2 b2 c2 b3 c3 . c4 . .

I guess through this way I could get all collinear ortholog genes, but at the same time losing the block informations. It also seems this won't influence the plotting of synteny.

If there are any errors, I am very appreciated if you can point them out.