tanghaibao / jcvi

Python library to facilitate genome assembly, annotation, and comparative genomics
BSD 2-Clause "Simplified" License
721 stars 186 forks source link

Synteny depth pattern and blocks #449

Open cdanmaigona opened 2 years ago

cdanmaigona commented 2 years ago

Hello Haibao,

A follow-up question on this. When I run python -m jcvi.compara.synteny depth --histogram F1.F4.anchors --depthfile=F1.F4.depth I get this

Genome F1 depths: Depth 0: 704 of 16,800 (4.2%) Depth 1: 15,612 of 16,800 (92.9%) Depth 2: 173 of 16,800 (1.0%) Depth 3: 138 of 16,800 (0.8%) Depth 4: 106 of 16,800 (0.6%) Depth 5: 57 of 16,800 (0.3%) Depth 6: 10 of 16,800 (0.1%) Genome F4 depths: Depth 0: 2,998 of 19,588 (15.3%) Depth 1: 16,130 of 19,588 (82.3%) Depth 2: 340 of 19,588 (1.7%) Depth 3: 120 of 19,588 (0.6%) [08:41:07 PM] DEBUG Depth written to F1.F4. synteny.py:1773 F1 vs F4 syntenic depths 1:1 pattern

From the explanation on your wiki, there are up to 6 F4 blocks per F1 gene

however, when I run this python -m jcvi.compara.synteny stats F1.F4.i6.blocks to get the statistics on my blocks and actual duplicate genes, the numbers do not correlate.

Count 0: 1,450 of 16,800 (8.6%) Count 1: 15,052 of 16,800 (89.6%) Count 2: 87 of 16,800 (0.5%) Count 3: 83 of 16,800 (0.5%) Count 4: 48 of 16,800 (0.3%) Count 5: 80 of 16,800 (0.5%)

Total lines with matches: 15,350 of 16,800 (91.4%) Count 1: 15,052 of 15,350 (98.1%) Count 2: 87 of 15,350 (0.6%) Count 3: 83 of 15,350 (0.5%) Count 4: 48 of 15,350 (0.3%) Count 5: 80 of 15,350 (0.5%)

The numbers do not correspond to what I'm getting with the depth command. I can only see a maximum of 5 duplicates when the depth analysis shows up to 6. Please help me understand what I'm missing.

Thank you!!

Originally posted by @cdanmaigona in https://github.com/tanghaibao/jcvi/issues/235#issuecomment-1060138221

cdanmaigona commented 2 years ago

Hello Haibao

If I'm interested in extracting all possible duplicate genes in a comparison will this command be most appropriate?

python -m jcvi.compara.synteny mcscan F1.bed F1.F4.lifted.anchors --iter=6 -o F1.F4.i6.blocks

tanghaibao commented 2 years ago

@cdanmaigona

This is possibly the easiest that you can do. However, the caveat is that you'll miss some duplicate genes that are only present (in multiple copies) in F4 but not in F1.

Haibao

cdanmaigona commented 2 years ago

Thanks Haibao ,

That makes sense, but I am also interested in accounting for those multiple duplicates only present within F4 and only present within F1. How can I extract those?

Catherine