tanghaibao / jcvi

Python library to facilitate genome assembly, annotation, and comparative genomics
BSD 2-Clause "Simplified" License
719 stars 185 forks source link

low number of ortholog pairs reported #60

Closed MichelMoser closed 2 weeks ago

MichelMoser commented 6 years ago

Dear Haibao, I try to compute dotplots with new assemblies anchored by allmaps of two closely related species. Both assembly annotation show about 35'000 gene models. But when calculating orthology and dotplots, only 20'000 gene pairs are reported. Is this because of options like --dist or because no orthologs are found between the other 15'000 genes (which would be a bad sign for my annotations). I expect to have at last 30'000 orthologs if not 34'000 between the species with very high cscore. How can i force all the orthologs to be plotted?

Thank you, Michel

tanghaibao commented 6 years ago

@MichelMoser this is natural. --dist removes tandem dups. --cscore selects only those that are reciprocally good matching (as a result, gene families with recent expansion are less likely to get selected). So both would have an impact on the number of ortholog pairs. Another factor may be that genes are in small contigs and not in big synteny blocks.

Look for .lifted.anchors which will have more orthologs but this is less stringent set.

Reporting all ortholog pairs is not trivial. People usually go with reciprocal best (which is essentially --cscore=1) but end up with few pairs.

lvqiang0120 commented 9 months ago

Dear Haibao, "I have assembled genomes for a single arthropod species from both male and female individuals using HiFi sequencing data. The male genome was scaffolded against the female reference using RagTag. Both genomes had similar assembly quality and length, and were annotated using the same pipeline resulting in similar gene numbers (33,532 for female and 34,540 for male). However, when performing collinearity analysis using jcvi with default parameters, only 14,075 genes showed collinearity. I tried adjusting the --cscore and --dist parameters but the number of collinear genes did not change significantly. I don't understand why the number of collinear genes is so low. Ideally, the number of collinear genes between two genomes of the same species and different sexes should be more than 30,000. Where should I start looking for the possible reasons?" Thank you, Qiang Lyu

tanghaibao commented 9 months ago

@lvqiang0120

Can you share the dot plot? Also, did you also take a look at the pairs in the .lifted.anchors?

Haibao

lvqiang0120 commented 9 months ago

I would send you dot plot through Email. And the pairs in the .lifted.anchors as shown:

BHaeL_F.chr1G000225 BHaeL_M.chr1G000040 1450 BHaeL_F.chr1G000226 BHaeL_M.chr1G000041 180 BHaeL_F.chr1G000227 BHaeL_M.chr1G000043 1460 BHaeL_F.chr1G000228 BHaeL_M.chr1G000046 517 BHaeL_F.chr1G000229 BHaeL_M.chr1G000047 397 BHaeL_F.chr1G000231 BHaeL_M.chr1G000053 141 BHaeL_F.chr1G000232 BHaeL_M.chr1G000054 612 BHaeL_F.chr1G000236 BHaeL_M.chr1G000056 2510 BHaeL_F.chr1G000238 BHaeL_M.chr1G000058 256 BHaeL_F.chr1G000243 BHaeL_M.chr1G000060 372 BHaeL_F.chr1G000255 BHaeL_M.chr1G000065 574 BHaeL_F.chr1G000256 BHaeL_M.chr1G000066 1240 BHaeL_F.chr1G000257 BHaeL_M.chr1G000067 2090 BHaeL_F.chr1G000258 BHaeL_M.chr1G000069 247 ....... In this result, what should I pay attention to? And, by using the command "grep -v "#" BHaeL_F.BHaeL_M.lifted.anchors | wc -l" for counting, the result is 37,113.

lvqiang0120 commented 9 months ago

Dear Professor Tang, The dot plot shown as followed: Thanks, Qiang lyu.

At 2023-10-10 13:51:32, "Haibao Tang" @.***> wrote:

@lvqiang0120

Can you share the dot plot? Also, did you also take a look at the pairs in the .lifted.anchors?

Haibao

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>