Closed bschilder closed 2 years ago
I should also mention, orthogene:::map_orthologs_babelgene
uses the babelgene:::orthologs_df
from an older version of babelgene which I had stored in GitHub and would download each time with piggyback
:
https://github.com/neurogenomics/orthogene/blob/main/R/all_genes_babelgene.R
When I try running the last example above using the updated babelgene:::orthologs_df
(from v22.3) gene_map2
only contains 252 orthologs mappings. Have there been some major changes to babelgene:::orthologs_df
?
Thanks for including babelgene
in your meta-database. As you are well aware of, ortholog mapping is a complex issue. I am not sure there exists a great answer to your questions, but I will offer my perspective.
Regarding the total numbers, I believe the discrepancy is due to internal filtering. I filter out any mappings found in only one database. That reduces the total numbers by about half across all species. I haven't checked mouse specifically, but I assume it would be similar.
Regarding converting back and forth between human and mouse, the discrepancy might be due to genes that aren't 1:1. For example, without any filtering, A2m converts to A2M and PZP, but then using A2M and PZP returns A2m, Gm7298, Mug1, Mug2.
Regarding the full orthologs data frame, I am not sure I understand exactly what the problem is. There should not be many discrepancies between versions. I have some unit tests for total numbers. They are not very precise since it's not clear what the correct numbers should be, but I did not see big differences.
Thanks so much for the insight @igordot
Regarding the total numbers, I believe the discrepancy is due to internal filtering. I filter out any mappings found in only one database. That reduces the total numbers by about half across all species. I haven't checked mouse specifically, but I assume it would be similar.
So just to confirm, even when babelgene::orthologs(min_support = 1)
the minimum support threshold is actually 2?
Regarding converting back and forth between human and mouse, the discrepancy might be due to genes that aren't 1:1. For example, without any filtering, A2m converts to A2M and PZP, but then using A2M and PZP returns A2m, Gm7298, Mug1, Mug2.
I'd definitely expect there to be some loss when converting back and forth across species using babelgene::orthologs
, but not with orthogene:::map_orthologs_babelgene
because that doesn't do any filtering (it just returns the full database of gene mappings that overlap with any of the input genes). That said, I see now that in my code that when the target species (the one we're converting to) is not human, I switch from the non-filtered database method to using babelgene::orthologs
which, given that min_support
is actually always 2 according to above, explains the discrepancy.
Regarding the full orthologs data frame, I am not sure I understand exactly what the problem is. There should not be many discrepancies between versions. I have some unit tests for total numbers. They are not very precise since it's not clear what the correct numbers should be, but I did not see big differences.
Thanks for confirming this. I had wondered if perhaps babelgene
was using other additional data.frames for these mappings (e.g. babelgene:::mgi_orthologs_df
) that I wasn't using, that might account for these differences. But it seems like that's not the case. I'll take another look and see if I can spot the source of the discrepancies.
I should also mention,
orthogene:::map_orthologs_babelgene
uses thebabelgene:::orthologs_df
from an older version of babelgene which I had stored in GitHub and would download each time withpiggyback
: https://github.com/neurogenomics/orthogene/blob/main/R/all_genes_babelgene.RWhen I try running the last example above using the updated
babelgene:::orthologs_df
(from v22.3)gene_map2
only contains 252 orthologs mappings. Have there been some major changes tobabelgene:::orthologs_df
?
I believe I found the source of this particular discrepancy. This was a bug within orthogene
when renaming certain columns. I've fixed and simplified this part of the code and it seems to be working as expected now, with both the old and the new babelgene:::mgi_orthologs_df
. You can see my unit tests here:
https://github.com/neurogenomics/orthogene/blob/main/tests/testthat/test-map_orthologs_babelgene.R
even when
babelgene::orthologs(min_support = 1)
the minimum support threshold is actually 2?
Yes. It's a purely mathematical definition, so 1, 2, 0, -99 would all return everything. I should probably clarify the documentation.
perhaps babelgene was using other additional data.frames for these mappings (e.g. babelgene:::mgi_orthologs_df)
No, mgi_orthologs_df
is just for testing. It's there as an independent (or at least independently curated) dataset of human-mouse mappings and human/mouse gene symbols in general.
It's great to see you have such extensive testing with orthogene
. It's hard to estimate what the margin of error should be for a constantly-evolving dataset, so any independent testing is helpful. I noticed some errors when doing reverse dependency checks with the Bioconductor version, but it looked like those were already fixed on GitHub.
even when
babelgene::orthologs(min_support = 1)
the minimum support threshold is actually 2?Yes. It's a purely mathematical definition, so 1, 2, 0, -99 would all return everything. I should probably clarify the documentation.
That would be very helpful, thank you!
perhaps babelgene was using other additional data.frames for these mappings (e.g. babelgene:::mgi_orthologs_df)
No,
mgi_orthologs_df
is just for testing. It's there as an independent (or at least independently curated) dataset of human-mouse mappings and human/mouse gene symbols in general.
Cool, thanks for confirming.
It's great to see you have such extensive testing with
orthogene
. It's hard to estimate what the margin of error should be for a constantly-evolving dataset, so any independent testing is helpful. I noticed some errors when doing reverse dependency checks with the Bioconductor version, but it looked like those were already fixed on GitHub.
Thanks! Unit tests have been a real game changer for my packages since i started using them last year.
Yeah, there were a couple of lingering issues that I pushed fixes to yesterday and today. There's a several-day delay for Bioc to rebuild the package, so I'm crossing my fingers that get is done (and passes) in time for the next Bioc release (3.15) on April 11th, at which time Bioc 3.14 will be frozen (forever!).
I believe I found the source of this particular discrepancy. This was a bug within
orthogene
when renaming certain columns. I've fixed and simplified this part of the code and it seems to be working as expected now, with both the old and the newbabelgene:::mgi_orthologs_df
Now that this has been resolved, I've also moved away from retrieving the stored old version of orthologs_df
via piggyback
(which had its own set of issues described here) and now simply import the up-to-date database via get()
:
Thanks again for your help, Brian
I am closing the issue for now. Hopefully any future updates of both packages will go more smoothly. Let me know if you notice anything problematic.
Hello,
Thanks for developing this great package! I'm the maintainer of
orthogene
, which aims to convert orthologs across a variety of data types (lists, expression matrices, data.frames, etc.)babelgene
is one of the core methods thatorthogene
relies on, the others beinghomologene
andgprofiler2
.There were a couple things I was hoping you could help answer.
Question 1
I noticed some discrepancies between the number of orthologs that are produced when using babelgene in slightly different ways. This of course is expected when tweaking different parameters in
babelgene::orthologs
but I find that if I translate genes using the entire babelgene database (babelgene:::orthologs_df
) to run my own ortholog conversion, I retrieve far more orthologs (>30,000) than even the most lenient settings withbabelgene::orthologs
(>15,900). Do you know why this might be?Note: I default to my whole-database strategy bc one of orthogene's features is to let users to specify how exactly they want to handle non-1:1 orthologs. You can see the point in the code where I do this here.
Question 2
I noticed that there's less orthologs mapped when returning genes back to mouse from human (>29k vs. >30k). But without any filtering steps, the number of genes should remain the same when translating back to mouse, because non-orthologs were already dropped in the previous mouse --> human mapping. Have you observed this as well?
Reprex
Thanks in advance for your help. I hope to work together so we can improve both tools!
Best, Brian
Session info