Closed gabora closed 1 year ago
Hi Attila,
Thanks, I'm glad it's useful! :)
I think those are removed due to the one_to_many = 1
: Lgals7 and Gpr89 both have two human orthologues:
df %>% orthology_translate_column(
mouse_symbol,
id_type = 'genesymbol',
source_organism = 'mouse',
target_organism = 'human',
resource = 'homologene',
replace = 'human_symbol',
keep_untranslated = TRUE
)
# A tibble: 6 × 3
mouse_symbol human_symbol values
<chr> <chr> <dbl>
1 Gpr89 GPR89A 0
2 Gpr89 GPR89B 0
3 Lgals7 LGALS7B 1
4 Lgals7 LGALS7 1
5 Zfp809 NA 2
6 Syngr2 SYNGR2 3
LGALS7 is one of the few genes HGNC assigns two primary Gene Symbols to, see here the GN
rows. It means in humans, two genes (LGALS7 and LGALS7B), about 18k base pairs apart on chromosome 19, encode the same protein. I think it's a rare case, though definitely there are a few similar ones, I found 72 in human SwissProt:
library(OmnipathR)
library(rlang)
library(stringr)
library(dplyr)
all_uniprots(c('accession', 'gene_primary')) %>%
set_names('uniprot', 'genesymbol') %>%
filter(str_detect(genesymbol, ';'))
# A tibble: 72 × 2
uniprot genesymbol
<chr> <chr>
1 O15263 DEFB4A; DEFB4B
2 O75920 SERF1A; SERF1B
3 P01562 IFNA1; IFNA13
4 P02810 PRH1; PRH2
5 P04908 H2AC4; H2AC8
6 P0C0L5 C4B; C4B_2
7 P0C0S8 H2AC11; H2AC13; H2AC15; H2AC16; H2AC17
8 P0C5Z0 H2AB2; H2AB3
9 P0DN86 CGB3; CGB5; CGB8
10 P12532 CKMT1A; CKMT1B
# ℹ 62 more rows
# ℹ Use `print(n = ...)` to see more rows
The correct handling really depends on the purpose, if you have a suggestion I'm happy to hear.
The Gpr89 also has two orthologous genes in human, GPR89A and GPR89B, and those are not only encoded by different genes, but also considered different proteins, though their sequence is 100% identical. These were the same protein in UniProt, just like LGALS7, until 2010 when they split the record in two. According to UniProt, LGALS7 is not correctly represented, it should be split, because in SwissProt different genes should be represented in different records.
I think these are interesting examples, but how common they are?
library(OmnipathR)
library(rlang)
library(dplyr)
all_uniprots(c('accession', 'gene_primary', 'sequence')) %>%
set_names(c('ac', 'gs', 'seq')) %>%
group_by(seq) %>%
mutate(n_paralog = n()) %>%
ungroup %>%
filter(n_paralog > 1L) %>%
nrow
[1] 126
Overall, 126 human UniProt IDs have at least one identical paralogue, a few have even 4 such paralogues.
Best,
Denes
Hi Denes,
thanks a lot for detailed explanations, learnt a lot by reading your answer! I see that there is not a clear, biologically correct solution for the problem.
I think I was confused by the documentation a bit, because I thought keep_untranslated = TRUE
will guarantee that I will not loose rows in my data frame.
Also I thought (based on the doc) if I set one_to_many
to N
(1), then I get a maximum N
number of orthologues (knowing that the selected orthologues could be incomplete).
I would suggest, either to change a bit the doc string, e.g.:
one_to_many | Integer: maximum number of orthologous pairs for one gene of the source organism. _Genes with more than one_tomany orthologues will be removed.
or (which I would have expected) to change
to
dplyr::slice(1:one_to_many) %>%
anyways, thanks again for the explanation! best
Thanks Attila, Indeed the docs should be more clear, I'll update it soon. I'm not sure it's good to randomly drop some orthologues if there are too many. If users really wants to do it, they can do it, but I would be hesitant to make it a default behaviour.
Hi Denes,
the
orthology_translate_column
is my new favourite function of Omnipath! thanks for pointing to it. I noticed that in some cases it drops some genes, even though I explicitly asks to keep untranslated genes.Here is a minimum example reproducing the issue:
The output is:
Zfp809 was not translated - this is OK, I checked, it is not in NCBI database, but rows for "Gpr89" and "Lgals7" are just lost.
Actually I find Lgals7 on NCBI's homologene database:
https://www.ncbi.nlm.nih.gov/homologene/?term=Mus+musculus+Lgals7
thanks!