tanghaibao / jcvi

Python library to facilitate genome assembly, annotation, and comparative genomics
BSD 2-Clause "Simplified" License
741 stars 187 forks source link

Why the comparison result of the same genome is less linear than other genomes? #239

Closed Xuelei-Dai closed 4 years ago

Xuelei-Dai commented 4 years ago

Hello, Why A3A align yourself on Inter-genomic comparison synteny is 5418 gene pairs, however, A3A align another genome (V.vinifera) is 10307 gene pairs? 1 2

Xuelei-Dai commented 4 years ago

The logging information as follows: A3A align A3A: 20:18:44 [base] lastdb A3A_1 A3A_1.cds 20:18:53 [base] lastal -u 0 -P 32 -i3G -f BlastTab A3A_1 A3A.cds >A3A.A3A_1.last 20:18:58 [synteny] Assuming --qbed=A3A.bed --sbed=A3A_1.bed 20:18:58 [base] Load fileA3A.bed 20:18:58 [base] Load fileA3A_1.bed 20:18:59 [blastfilter] Load BLAST fileA3A.A3A_1.last(total 200330 lines) 20:18:59 [base] Load fileA3A.A3A_1.last 20:19:01 [blastfilter] running the cscore filter (cscore>=0.99) .. 20:19:01 [blastfilter] after filter (116895->11118) .. 20:19:01 [blastfilter] running the local dups filter (tandem_Nmax=10) .. 20:19:01 [blastfilter] after filter (11118->11078) .. 20:19:01 [synteny] Assuming --qbed=A3A.bed --sbed=A3A_1.bed 20:19:01 [base] Load fileA3A.bed 20:19:02 [base] Load fileA3A_1.bed 20:19:02 [base] Load fileA3A.A3A_1.last.filtered 20:19:02 [synteny] A total of 11078 BLAST imported fromA3A.A3A_1.last.filtered. 20:19:02 [synteny] Chaining distance = 20 20:19:02 [base] Load fileA3A.A3A_1.anchors A total of 5418 (NR:5418) anchors found in 613 clusters. Stats: Min=4 Max=50 N=613 Mean=8.83849918434 SD=6.39165078091 Median=6.0 Sum=5418 NR stats: Min=4 Max=50 N=613 Mean=8.83849918434 SD=6.39165078091 Median=6.0 Sum=5418 20:19:02 [base] Load fileA3A.bed 20:19:03 [base] Load fileA3A_1.bed 20:19:03 [base] Load fileA3A.A3A_1.last 20:19:06 [synteny] A total of 139992 BLAST imported fromA3A.A3A_1.last. 20:19:06 [base] Load fileA3A.A3A_1.anchors 20:19:06 [synteny] A total of 5418 anchors imported. 20:19:07 [synteny] 16935 new pairs found. 20:19:07 [synteny] Removed 0 existing anchors. 20:19:07 [synteny] Corrected scores for 8 anchors. 20:19:07 [synteny] Anchors written toA3A.A3A_1.lifted.anchors. 20:19:07 [base] Load fileA3A.A3A_1.lifted.anchors A total of 22353 (NR:11660) anchors found in 613 clusters. Stats: Min=4 Max=271 N=613 Mean=36.4649265905 SD=42.1239237556 Median=22.0 Sum=22353 NR stats: Min=4 Max=113 N=613 Mean=19.0212071778 SD=14.878119537 Median=15.0 Sum=11660 20:19:08 [synteny] Assuming --qbed=A3A.bed --sbed=A3A_1.bed 20:19:08 [base] Load fileA3A.bed 20:19:08 [base] Load fileA3A_1.bed 20:19:09 [dotplot] xsize=23097 ysize=23097

A3A align Vvinifera: 17:04:22 [base] lastdb Vvinifera Vvinifera.cds 17:04:33 [base] lastal -u 0 -P 32 -i3G -f BlastTab Vvinifera A3A.cds >A3A.Vvinifera.last 17:04:38 [synteny] Assuming --qbed=A3A.bed --sbed=Vvinifera.bed 17:04:38 [base] Load fileA3A.bed 17:04:38 [base] Load fileVvinifera.bed 17:04:39 [blastfilter] Load BLAST fileA3A.Vvinifera.last(total 214735 lines) 17:04:39 [base] Load fileA3A.Vvinifera.last 17:04:41 [blastfilter] running the cscore filter (cscore>=0.99) .. 17:04:42 [blastfilter] after filter (148593->13005) .. 17:04:42 [blastfilter] running the local dups filter (tandem_Nmax=10) .. 17:04:42 [blastfilter] after filter (13005->12826) .. 17:04:42 [synteny] Assuming --qbed=A3A.bed --sbed=Vvinifera.bed 17:04:42 [base] Load fileA3A.bed 17:04:42 [base] Load fileVvinifera.bed 17:04:43 [base] Load fileA3A.Vvinifera.last.filtered 17:04:43 [synteny] A total of 12826 BLAST imported fromA3A.Vvinifera.last.filtered. 17:04:43 [synteny] Chaining distance = 20 17:04:43 [base] Load fileA3A.Vvinifera.anchors A total of 10307 (NR:10266) anchors found in 406 clusters. Stats: Min=4 Max=243 N=406 Mean=25.3866995074 SD=34.7148939579 Median=12.0 Sum=10307 NR stats: Min=4 Max=243 N=406 Mean=25.2857142857 SD=34.6658524057 Median=12.0 Sum=10266 17:04:43 [base] Load fileA3A.bed 17:04:43 [base] Load fileVvinifera.bed 17:04:44 [base] Load fileA3A.Vvinifera.last 17:04:46 [synteny] A total of 148593 BLAST imported fromA3A.Vvinifera.last. 17:04:47 [base] Load fileA3A.Vvinifera.anchors 17:04:47 [synteny] A total of 10307 anchors imported. 17:04:47 [synteny] 9408 new pairs found. 17:04:47 [synteny] Removed 0 existing anchors. 17:04:47 [synteny] Corrected scores for 37 anchors. 17:04:47 [synteny] Anchors written toA3A.Vvinifera.lifted.anchors. 17:04:47 [base] Load fileA3A.Vvinifera.lifted.anchors A total of 19715 (NR:13852) anchors found in 406 clusters. Stats: Min=4 Max=470 N=406 Mean=48.5591133005 SD=68.1665246074 Median=23.0 Sum=19715 NR stats: Min=4 Max=314 N=406 Mean=34.118226601 SD=47.1874034791 Median=16.5 Sum=13852 17:05:04 [synteny] Assuming --qbed=A3A.bed --sbed=Vvinifera.bed 17:05:04 [base] Load fileA3A.bed 17:05:04 [base] Load fileVvinifera.bed 17:05:05 [dotplot] Showing a random subset of 10000 data points (total 10307) for clarity. 17:05:06 [dotplot] xsize=23097 ysize=26346

tanghaibao commented 4 years ago

@fantasticair

First of all, when aligning a genome against itself, I suggest that you can just feed the same genome name (in your case just use A3A, and no need to create a fake A3A_1). Once you use the same genome name, MCscan will use a specialized pipeline to look for WGD blocks. For example:

$ python -m jcvi.compara.catalog ortholog grape grape

image

This is grape-grape self-alignment. Notice that there are not a lot of pairs at ~1800 pairs, but most of these pairs were derived from the genome duplications (aka the gamma event in eudicots). You also notice that most of the diagonal pairs are removed resulted from the MCscan pipeline.

I understand that you hope to see lots of gene pairs on the diagonal. However, the important thing here to remember is that the use case for genome vs. itself is often looking for past WGD events (the blocks off-diagonal in the dot plot). Many of the self hits will get filtered since they would often end in tandem arrays, or low c-scores and get removed by blastfilter module in MCscan.

This is a design choice, since self-hits are just not very informative when you run the MCscan workflow. If you have a use case other than looking for WGD events, then this behavior might not work for you.

Haibao

Xuelei-Dai commented 4 years ago

Thank you very much for your prompt reply.