Differences between ortholog mapping strategies

bschilder commented 2 years ago

Hello,

Thanks for developing this great package! I'm the maintainer of orthogene, which aims to convert orthologs across a variety of data types (lists, expression matrices, data.frames, etc.) babelgene is one of the core methods that orthogene relies on, the others being homologene and gprofiler2.

There were a couple things I was hoping you could help answer.

Question 1

I noticed some discrepancies between the number of orthologs that are produced when using babelgene in slightly different ways. This of course is expected when tweaking different parameters in babelgene::orthologs but I find that if I translate genes using the entire babelgene database (babelgene:::orthologs_df) to run my own ortholog conversion, I retrieve far more orthologs (>30,000) than even the most lenient settings with babelgene::orthologs (>15,900). Do you know why this might be?

Note: I default to my whole-database strategy bc one of orthogene's features is to let users to specify how exactly they want to handle non-1:1 orthologs. You can see the point in the code where I do this here.

Question 2

I noticed that there's less orthologs mapped when returning genes back to mouse from human (>29k vs. >30k). But without any filtering steps, the number of genes should remain the same when translating back to mouse, because non-orthologs were already dropped in the previous mouse --> human mapping. Have you observed this as well?

Reprex

install.packages("testthat")
BiocManager::install("orthogene")

data("exp_mouse")

gene_map_b1 <- babelgene::orthologs( 
        genes = rownames(exp_mouse),
        species = "mouse",
        human = FALSE
    )  
    testthat::expect_gte(nrow(gene_map_b1), 13100)

    gene_map_b2 <- babelgene::orthologs( 
        genes = rownames(exp_mouse),
        species = "mouse",
        human = FALSE, 
        min_support = 1
    )
    testthat::expect_gte(nrow(gene_map_b2), 13300)

    gene_map_b3 <- babelgene::orthologs( 
        genes = rownames(exp_mouse),
        species = "mouse",
        human = FALSE, 
        min_support = 1, 
        top = FALSE
    )  
    testthat::expect_gte(nrow(gene_map_b3), 15900)

    #### mouse ==> human ####
    gene_map1 <- orthogene:::map_orthologs_babelgene(
        genes = rownames(exp_mouse),
        input_species = "mouse",
        output_species = "human"
    )
    testthat::expect_gte(nrow(gene_map1), 30000)

    #### human ==> mouse ####
    gene_map2 <- orthogene:::map_orthologs_babelgene(
        genes = gene_map1$ortholog_gene,
        input_species = "human",
        output_species = "mouse"
    )
    testthat::expect_gte(nrow(gene_map2), 29000)

Thanks in advance for your help. I hope to work together so we can improve both tools!

Best, Brian

Session info

``` R version 4.1.3 (2022-03-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.3.1 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] orthogene_1.1.4 GenomicRanges_1.46.1 GenomeInfoDb_1.30.1 IRanges_2.28.0 [5] S4Vectors_0.32.4 BiocGenerics_0.40.0 Matrix_1.4-0 pkgbuild_1.3.1 [9] testthat_3.1.3 loaded via a namespace (and not attached): [1] colorspace_2.0-3 ggtree_3.2.1 grr_0.9.5 [4] ggsignif_0.6.3 gitcreds_0.1.1 ellipsis_0.3.2 [7] rprojroot_2.0.3 XVector_0.34.0 fs_1.5.2 [10] aplot_0.1.3 rstudioapi_0.13 httpcode_0.3.0 [13] roxygen2_7.1.2 ggpubr_0.4.0 waldo_0.4.0 [16] remotes_2.4.2 gh_1.3.0 lubridate_1.8.0 [19] fansi_1.0.3 xml2_1.3.3 R.methodsS3_1.8.1 [22] sparseMatrixStats_1.6.0 cachem_1.0.6 knitr_1.38 [25] pkgload_1.2.4 jsonlite_1.8.0 broom_0.7.12 [28] gridBase_0.4-7 png_0.1-7 R.oo_1.24.0 [31] compiler_4.1.3 httr_1.4.2 backports_1.4.1 [34] assertthat_0.2.1 fastmap_1.1.0 lazyeval_0.2.2 [37] cli_3.2.0 htmltools_0.5.2 prettyunits_1.1.1 [40] tools_4.1.3 gtable_0.3.0 glue_1.6.2 [43] GenomeInfoDbData_1.2.7 dplyr_1.0.8 Rcpp_1.0.8.3 [46] carData_3.0-5 xopen_1.0.0 vctrs_0.4.0 [49] crul_1.2.0 babelgene_22.3 ape_5.6-2 [52] nlme_3.1-155 DelayedMatrixStats_1.16.0 repmis_0.5 [55] xfun_0.30 stringr_1.4.0 ps_1.6.0 [58] brio_1.1.3 rphylopic_0.3.3 lifecycle_1.0.1 [61] devtools_2.4.3 rstatix_0.7.0 zlibbioc_1.40.0 [64] scales_1.1.1 clisymbols_1.2.0 MatrixGenerics_1.6.0 [67] parallel_4.1.3 gprofiler2_0.2.1 yaml_2.3.5 [70] curl_4.3.2 memoise_2.0.1 ggplot2_3.3.5 [73] ggfun_0.0.6 rcmdcheck_1.4.0 Matrix.utils_0.9.8 [76] yulab.utils_0.0.4 stringi_1.7.6 desc_1.4.1 [79] tidytree_0.3.9 rlang_1.0.2 pkgconfig_2.0.3 [82] bitops_1.0-7 matrixStats_0.61.0 evaluate_0.15 [85] lattice_0.20-45 purrr_0.3.4 treeio_1.18.1 [88] patchwork_1.1.1 htmlwidgets_1.5.4 processx_3.5.3 [91] tidyselect_1.1.2 plyr_1.8.7 magrittr_2.0.3 [94] R6_2.5.1 piggyback_0.1.1 generics_0.1.2 [97] DelayedArray_0.20.0 DBI_1.1.2 pillar_1.7.0 [100] withr_2.5.0 abind_1.4-5 RCurl_1.98-1.6 [103] tibble_3.1.6 homologene_1.4.68.19.3.27 crayon_1.5.1 [106] car_3.0-12 utf8_1.2.2 plotly_4.10.0 [109] rmarkdown_2.13 usethis_2.1.5 grid_4.1.3 [112] data.table_1.14.2 callr_3.7.0 digest_0.6.29 [115] R.cache_0.15.0 tidyr_1.2.0 gridGraphics_0.5-1 [118] R.utils_2.11.0 munsell_0.5.0 viridisLite_0.4.0 [121] ggplotify_0.1.0 sessioninfo_1.2.2 ```

bschilder commented 2 years ago

I should also mention, orthogene:::map_orthologs_babelgene uses the babelgene:::orthologs_df from an older version of babelgene which I had stored in GitHub and would download each time with piggyback: https://github.com/neurogenomics/orthogene/blob/main/R/all_genes_babelgene.R

When I try running the last example above using the updated babelgene:::orthologs_df (from v22.3) gene_map2 only contains 252 orthologs mappings. Have there been some major changes to babelgene:::orthologs_df?

igordot commented 2 years ago

Thanks for including babelgene in your meta-database. As you are well aware of, ortholog mapping is a complex issue. I am not sure there exists a great answer to your questions, but I will offer my perspective.

Regarding the total numbers, I believe the discrepancy is due to internal filtering. I filter out any mappings found in only one database. That reduces the total numbers by about half across all species. I haven't checked mouse specifically, but I assume it would be similar.

Regarding converting back and forth between human and mouse, the discrepancy might be due to genes that aren't 1:1. For example, without any filtering, A2m converts to A2M and PZP, but then using A2M and PZP returns A2m, Gm7298, Mug1, Mug2.

Regarding the full orthologs data frame, I am not sure I understand exactly what the problem is. There should not be many discrepancies between versions. I have some unit tests for total numbers. They are not very precise since it's not clear what the correct numbers should be, but I did not see big differences.

bschilder commented 2 years ago

Thanks so much for the insight @igordot

Regarding the total numbers, I believe the discrepancy is due to internal filtering. I filter out any mappings found in only one database. That reduces the total numbers by about half across all species. I haven't checked mouse specifically, but I assume it would be similar.

So just to confirm, even when babelgene::orthologs(min_support = 1) the minimum support threshold is actually 2?

Regarding converting back and forth between human and mouse, the discrepancy might be due to genes that aren't 1:1. For example, without any filtering, A2m converts to A2M and PZP, but then using A2M and PZP returns A2m, Gm7298, Mug1, Mug2.

I'd definitely expect there to be some loss when converting back and forth across species using babelgene::orthologs, but not with orthogene:::map_orthologs_babelgene because that doesn't do any filtering (it just returns the full database of gene mappings that overlap with any of the input genes). That said, I see now that in my code that when the target species (the one we're converting to) is not human, I switch from the non-filtered database method to using babelgene::orthologs which, given that min_support is actually always 2 according to above, explains the discrepancy.

Regarding the full orthologs data frame, I am not sure I understand exactly what the problem is. There should not be many discrepancies between versions. I have some unit tests for total numbers. They are not very precise since it's not clear what the correct numbers should be, but I did not see big differences.

Thanks for confirming this. I had wondered if perhaps babelgene was using other additional data.frames for these mappings (e.g. babelgene:::mgi_orthologs_df) that I wasn't using, that might account for these differences. But it seems like that's not the case. I'll take another look and see if I can spot the source of the discrepancies.

bschilder commented 2 years ago

I should also mention, orthogene:::map_orthologs_babelgene uses the babelgene:::orthologs_df from an older version of babelgene which I had stored in GitHub and would download each time with piggyback: https://github.com/neurogenomics/orthogene/blob/main/R/all_genes_babelgene.R

When I try running the last example above using the updated babelgene:::orthologs_df (from v22.3) gene_map2 only contains 252 orthologs mappings. Have there been some major changes to babelgene:::orthologs_df?

I believe I found the source of this particular discrepancy. This was a bug within orthogene when renaming certain columns. I've fixed and simplified this part of the code and it seems to be working as expected now, with both the old and the new babelgene:::mgi_orthologs_df. You can see my unit tests here:

https://github.com/neurogenomics/orthogene/blob/main/tests/testthat/test-map_orthologs_babelgene.R

igordot commented 2 years ago

even when babelgene::orthologs(min_support = 1) the minimum support threshold is actually 2?

Yes. It's a purely mathematical definition, so 1, 2, 0, -99 would all return everything. I should probably clarify the documentation.

perhaps babelgene was using other additional data.frames for these mappings (e.g. babelgene:::mgi_orthologs_df)

No, mgi_orthologs_df is just for testing. It's there as an independent (or at least independently curated) dataset of human-mouse mappings and human/mouse gene symbols in general.

It's great to see you have such extensive testing with orthogene. It's hard to estimate what the margin of error should be for a constantly-evolving dataset, so any independent testing is helpful. I noticed some errors when doing reverse dependency checks with the Bioconductor version, but it looked like those were already fixed on GitHub.

bschilder commented 2 years ago

even when babelgene::orthologs(min_support = 1) the minimum support threshold is actually 2?

Yes. It's a purely mathematical definition, so 1, 2, 0, -99 would all return everything. I should probably clarify the documentation.

That would be very helpful, thank you!

perhaps babelgene was using other additional data.frames for these mappings (e.g. babelgene:::mgi_orthologs_df)

No, mgi_orthologs_df is just for testing. It's there as an independent (or at least independently curated) dataset of human-mouse mappings and human/mouse gene symbols in general.

Cool, thanks for confirming.

It's great to see you have such extensive testing with orthogene. It's hard to estimate what the margin of error should be for a constantly-evolving dataset, so any independent testing is helpful. I noticed some errors when doing reverse dependency checks with the Bioconductor version, but it looked like those were already fixed on GitHub.

Thanks! Unit tests have been a real game changer for my packages since i started using them last year.

Yeah, there were a couple of lingering issues that I pushed fixes to yesterday and today. There's a several-day delay for Bioc to rebuild the package, so I'm crossing my fingers that get is done (and passes) in time for the next Bioc release (3.15) on April 11th, at which time Bioc 3.14 will be frozen (forever!).

I believe I found the source of this particular discrepancy. This was a bug within orthogene when renaming certain columns. I've fixed and simplified this part of the code and it seems to be working as expected now, with both the old and the new babelgene:::mgi_orthologs_df

Now that this has been resolved, I've also moved away from retrieving the stored old version of orthologs_df via piggyback (which had its own set of issues described here) and now simply import the up-to-date database via get():

https://github.com/neurogenomics/orthogene/blob/8354ef81f4fe68acac346fb0ab349e41204154a7/R/all_genes_babelgene.R#L40

Thanks again for your help, Brian

igordot commented 2 years ago

I am closing the issue for now. Hopefully any future updates of both packages will go more smoothly. Let me know if you notice anything problematic.

bschilder commented 2 years ago

Sounds good! All unit tests in orthogene seem to be passing all Bioconductor checks now using the latest version of babelgene (both on Bioc 3.14 and 3.15). But I'll be sure to keep you posted if anything comes up!

Thanks again, Brian

igordot / babelgene