genes lost in orthology_translate_column

gabora commented 1 year ago

Hi Denes,

the orthology_translate_column is my new favourite function of Omnipath! thanks for pointing to it. I noticed that in some cases it drops some genes, even though I explicitly asks to keep untranslated genes.

Here is a minimum example reproducing the issue:

df = tibble(mouse_symbol = c("Gpr89",  "Lgals7","Zfp809","Syngr2"),
                   values= c(0,1,2,3))

OmnipathR::orthology_translate_column(data = df, 
                                       column = "mouse_symbol",
                                       id_type = "genesymbol",
                                       target_organism = "human",
                                       source_organism = "mouse",
                                       resource = "homologene",
                                       replace = "human_symbol",
                                       one_to_many = 1,
                                       keep_untranslated = TRUE,
                                       translate_complexes = FALSE)

The output is:

# A tibble: 2 × 3
  mouse_symbol human_symbol   tmp
  <chr>        <chr>        <dbl>
1 Zfp809       NA               2
2 Syngr2       SYNGR2           3

Zfp809 was not translated - this is OK, I checked, it is not in NCBI database, but rows for "Gpr89" and "Lgals7" are just lost.

Actually I find Lgals7 on NCBI's homologene database:
https://www.ncbi.nlm.nih.gov/homologene/?term=Mus+musculus+Lgals7

do you know why rows are lost and
why genes in NCBI homologene database are not found?

thanks!

deeenes commented 1 year ago

Hi Attila,

Thanks, I'm glad it's useful! :)

I think those are removed due to the one_to_many = 1: Lgals7 and Gpr89 both have two human orthologues:

df %>% orthology_translate_column(
    mouse_symbol,
    id_type = 'genesymbol',
    source_organism = 'mouse',
    target_organism = 'human',
    resource = 'homologene',
    replace = 'human_symbol',
    keep_untranslated = TRUE
)
# A tibble: 6 × 3
  mouse_symbol human_symbol      values
  <chr>        <chr>              <dbl>
1 Gpr89        GPR89A                 0
2 Gpr89        GPR89B                 0
3 Lgals7       LGALS7B                1
4 Lgals7       LGALS7                 1
5 Zfp809       NA                     2
6 Syngr2       SYNGR2                 3

LGALS7 is one of the few genes HGNC assigns two primary Gene Symbols to, see here the GN rows. It means in humans, two genes (LGALS7 and LGALS7B), about 18k base pairs apart on chromosome 19, encode the same protein. I think it's a rare case, though definitely there are a few similar ones, I found 72 in human SwissProt:

library(OmnipathR)
library(rlang)
library(stringr)
library(dplyr)

all_uniprots(c('accession', 'gene_primary')) %>%
set_names('uniprot', 'genesymbol') %>%
filter(str_detect(genesymbol, ';'))

# A tibble: 72 × 2
   uniprot genesymbol                            
   <chr>   <chr>                                 
 1 O15263  DEFB4A; DEFB4B                        
 2 O75920  SERF1A; SERF1B                        
 3 P01562  IFNA1; IFNA13                         
 4 P02810  PRH1; PRH2                            
 5 P04908  H2AC4; H2AC8                          
 6 P0C0L5  C4B; C4B_2                            
 7 P0C0S8  H2AC11; H2AC13; H2AC15; H2AC16; H2AC17
 8 P0C5Z0  H2AB2; H2AB3                          
 9 P0DN86  CGB3; CGB5; CGB8                      
10 P12532  CKMT1A; CKMT1B                        
# ℹ 62 more rows
# ℹ Use `print(n = ...)` to see more rows

The correct handling really depends on the purpose, if you have a suggestion I'm happy to hear.

The Gpr89 also has two orthologous genes in human, GPR89A and GPR89B, and those are not only encoded by different genes, but also considered different proteins, though their sequence is 100% identical. These were the same protein in UniProt, just like LGALS7, until 2010 when they split the record in two. According to UniProt, LGALS7 is not correctly represented, it should be split, because in SwissProt different genes should be represented in different records.

I think these are interesting examples, but how common they are?

library(OmnipathR)
library(rlang)
library(dplyr)

all_uniprots(c('accession', 'gene_primary', 'sequence')) %>%
set_names(c('ac', 'gs', 'seq')) %>%
group_by(seq) %>%
mutate(n_paralog = n()) %>%
ungroup %>%
filter(n_paralog > 1L) %>%
nrow
[1] 126

Overall, 126 human UniProt IDs have at least one identical paralogue, a few have even 4 such paralogues.

Best,

Denes

gabora commented 1 year ago

Hi Denes,

thanks a lot for detailed explanations, learnt a lot by reading your answer! I see that there is not a clear, biologically correct solution for the problem.

I think I was confused by the documentation a bit, because I thought keep_untranslated = TRUE will guarantee that I will not loose rows in my data frame.

Also I thought (based on the doc) if I set one_to_many to N (1), then I get a maximum N number of orthologues (knowing that the selected orthologues could be incomplete).

I would suggest, either to change a bit the doc string, e.g.:

one_to_many | Integer: maximum number of orthologous pairs for one gene of the source organism. _Genes with more than one_tomany orthologues will be removed.

or (which I would have expected) to change

https://github.com/saezlab/OmnipathR/blob/19fc47cbd053ebdbb5bcaa7dc603dbad54754ca5/R/orthology.R#L340

to

dplyr::slice(1:one_to_many) %>%

anyways, thanks again for the explanation! best

deeenes commented 1 year ago

Thanks Attila, Indeed the docs should be more clear, I'll update it soon. I'm not sure it's good to randomly drop some orthologues if there are too many. If users really wants to do it, they can do it, but I would be hesitant to make it a default behaviour.

saezlab / OmnipathR

genes lost in orthology_translate_column #74