waldronlab / HGNChelper

Identify and correct invalid gene symbols
https://waldronlab.io/HGNChelper/
55 stars 9 forks source link

Alias not mapped if gene is correct #14

Open Aggarwal-Ayush opened 3 years ago

Aggarwal-Ayush commented 3 years ago

Example

> library(HGNChelper)

> hgnc.table[hgnc.table$Symbol == "AAVS1", ]
   Symbol Approved.Symbol chromosome
1:  AAVS1           AAVS1         19
2:  AAVS1        PPP1R12C         19

> checkGeneSymbols("AAVS1")
Maps last updated on: Mon Sep 28 18:31:21 2020
      x Approved Suggested.Symbol
1 AAVS1     TRUE            AAVS1

> mouse.table[mouse.table$Symbol %in% c("Cpamd8", "Mug2"), ]
      Symbol Approved.Symbol
6156    Mug2          Cpamd8
6157  Cpamd8            Mug2
78973   Mug2            Mug2

> checkGeneSymbols(c("Cpamd8", "Mug2"), species = "mouse")
Maps last updated on: Mon Sep 28 18:31:21 2020
       x Approved Suggested.Symbol
1 Cpamd8     TRUE           Cpamd8
2   Mug2     TRUE             Mug2

According to me the desired behaviour for a unit test should be this:

library(testthat)
library(HGNChelper)
res <- checkGeneSymbols("AAVS1")
expect_identical(res$Approved, TRUE)
expect_identical(res$Suggested.Symbol, "AAVS1 /// PPP1R12C")

res <- checkGeneSymbols(c("Cpamd8", "Mug2"), species = "mouse")
expect_identical(res$Approved, c(TRUE, TRUE))
expect_identical(res$Suggested.Symbol, c("Cpamd8 /// Mug2", "Mug2 /// Cpamd8"))

I can apply the fix for it if you think this issue needs fixing. We'll also need to change some unit test for this to apply.

lwaldron commented 3 years ago

If I understand correctly, AAVS1 is a virus integration site within the PPP1R12C gene, and Cpamd8/Mug2 are both synonyms for a protein-coding gene. In the AAVS1/PPP1R12C case they are different things, and don't need to be "corrected" to show both. In the second case, my reading of https://www.alliancegenome.org/gene/MGI:99836 is that both are correct, so expansion would be informational but not necessary.

If I'm right about these cases, I would be inclined to keep the current default behavior. But you could add a flag like expand.ambiguous (default FALSE) that would do the expansion you're talking about when set to TRUE (using the word "ambiguous" rather than "synonym" or "equivalent" since we don't have the information in the maps to know why the mapping is present in both directions. Does this make sense @Aggarwal-Ayush @shbrief ?

shbrief commented 3 years ago

@lwaldron Yes, it makes sense. I also prefer the current behavior - as far as the input is an approved symbol, providing additional aliases seems to be an unnecessary complication. It would be fine to add a expand.ambiguous flag with default FALSE.

Aggarwal-Ayush commented 3 years ago

Yes, this makes sense for the cases that I have presented. I was just worried about other such cases if present and also as the second reviewer mentioned, it could be that an older dataset is referring to an alias of the gene which is now also an approved symbol. In any case, expand.ambigous should solve this issue.

Rendan86 commented 3 years ago

Hi, I am quite the n00b around here. So first of all, please accept my apologies if I am writing in the wrong place. I just found this package and I am so very happy, as I have been struggling with Symbol standardization.

I don´t think I should open a new issue here, as my question is related to this current one. I have gone through all the documentation but I still find myself clueless... For Symbols in "x" that HGNC does not have a current "Approved" symbol, and more than one are suggested (maybe previous or alias according to HGNC database), how would you suggest to proceed so I end with a 1:1 symbol output that I can use to normalize nomenclature through several datasets?

Thanks in advance!

EDIT: After working a little more with the package, I may have found a real issue. When a list of symbols contains 2 entries for "2-Mar", they can stand for MARC2 or MARCH2, which are currently identified with an approved symbol in HGNC as MTARC2 and MARCHF2, respectively. However, $findExcelGeneSymbols() maps both only as MTARC2

EDIT2: After further inspection, I found that actually the output gene list of findExcelGeneSymbols() only changes the first "2-Mar" to MTARC2, the other one remains "2-Mar"....

lwaldron commented 3 years ago

Just a quick reply in generalities, can provide more specific code if helpful. There isn't a a 1:1 relationship between aliased and current symbols, but if you are not too concerned by losing some genes, you might choose to eliminate genes that mapped to multiple symbols (those with "///" in the approved column). Or you might split the "///" onto multiple rows, so that an ambiguous original symbol gets split into all its multitude possibilities. Finally, if somehow you know which chromosome the genes were on, providing the chromosome argument can help resolve those ambiguities.

lwaldron commented 3 years ago

I see what you mean @Rendan86 :

> checkGeneSymbols("2-Mar")
Maps last updated on: Thu Mar 25 08:36:49 2021
      x Approved   Suggested.Symbol
1 2-Mar    FALSE MARCHF2 /// MTARC2
Warning messages:
1: In checkGeneSymbols("2-Mar") :
  Human gene symbols should be all upper-case except for the 'orf' in open reading frames. The case of some letters was corrected.
2: In checkGeneSymbols("2-Mar") : x contains non-approved gene symbols
> findExcelGeneSymbols("2-Mar")
[1] "MTARC2"
Warning message:
In findExcelGeneSymbols("2-Mar") :
  Transmogrified gene symbols found.  Returning the following corrections: 2-Mar to MTARC2
> 

findExcelGeneSymbols does not do a check against all aliases, and I actually think that it no longer serves a useful purpose in the package. I would recommend using checkGeneSymbols instead, and I'll deprecate findExcelGeneSymbols unless I hear complaints.