igordot / msigdbr

MSigDB gene sets for multiple organisms in a tidy data format
https://igordot.github.io/msigdbr
Other
70 stars 14 forks source link

Add shorter GO descriptions? #19

Open TylerSagendorf opened 2 years ago

TylerSagendorf commented 2 years ago

The entries in the gs_description column for GO terms are rather long and not ideal for use as human-readable identifiers when plotting ORA or GSEA results. Would it be possible to add a gs_brief_description column that uses the names from the appropriate GO database release? I have been getting the data using the code below and then left-joining it to ORA and GSEA results tables made with fgsea. For other databases, I just use the entries in gs_description.

# install.packages(c("ontologyIndex", "dplyr"))
library(ontologyIndex)
library(dplyr)

# Brief GO term descriptions (use same data from MSigDB release notes)
file <- "http://release.geneontology.org/2021-12-15/ontology/go-basic.obo"
go_basic_list <- get_OBO(file,
                         propagate_relationships = "is_a",
                         extract_tags = "minimal")

# Convert to data.frame with fewer columns
go_basic_df <- as.data.frame(go_basic_list) %>%
  filter(!obsolete) %>%
  select(pathway = id, name)
igordot commented 2 years ago

Thank you for the suggestion. Currently, the package is just reformatting the original MSigDB for easier access. This might be outside the scope, but certainly worth considering.

To clarify, this is really an aesthetic change to make the name easier to read, right? For example, GOBP_5_PHOSPHORIBOSE_1_DIPHOSPHATE_METABOLIC_PROCESS becomes 5-phosphoribose 1-diphosphate metabolic process and GOBP_ACTIVATION_OF_CYSTEINE_TYPE_ENDOPEPTIDASE_ACTIVITY_INVOLVED_IN_APOPTOTIC_PROCESS_BY_CYTOCHROME_C becomes activation of cysteine-type endopeptidase activity involved in apoptotic process by cytochrome c.

TylerSagendorf commented 2 years ago

To clarify, this is really an aesthetic change to make the name easier to read, right? For example, GOBP_5_PHOSPHORIBOSE_1_DIPHOSPHATE_METABOLIC_PROCESS becomes 5-phosphoribose 1-diphosphate metabolic process and GOBP_ACTIVATION_OF_CYSTEINE_TYPE_ENDOPEPTIDASE_ACTIVITY_INVOLVED_IN_APOPTOTIC_PROCESS_BY_CYTOCHROME_C becomes activation of cysteine-type endopeptidase activity involved in apoptotic process by cytochrome c.

Yeah that's really all it is. Another solution would be to replace the underscores with spaces and change all text to lowercase, but that would remove intentional capitalization (such as with "mRNA") and characters that were replaced by underscores (like the dashes in your examples).

igordot commented 2 years ago

Yes, the original non-alphanumeric characters and capitalization are probably the most valuable aspect, and that can't be automatically fixed.