ramiromagno / gwasrapidd

gwasrapidd: an R package to query, download and wrangle GWAS Catalog data
https://rmagno.eu/gwasrapidd/
Other
89 stars 15 forks source link

Is it possible to add additional gene information? #24

Closed peranti closed 2 years ago

peranti commented 2 years ago

Hi Ramiro, Thanks for this amazing package.

Would it be possible to add Biotype in the association object?

For example, GATA3 is of protein_coding type. Source: https://www.ebi.ac.uk/gwas/genes/GATA3

Thanks.

ramiromagno commented 2 years ago

Hi Pradeep,

Thank you for your question! Let me look into that to see if that piece of information is provided by the GWAS REST API.

ramiromagno commented 2 years ago

Hi again Pradeep,

It seems that that information about the genes is only provided at the GWAS Catalog website, not via REST API at the moment.

I looked a bit though how the front-end website retrieves that information, and I found the URL being requested. So I wrote a bit of a hack to get that info into R: the get_gene() function. Then get_genes() is just a vectorised version of get_gene().

Finally I exemplify how you could use these functions to augment the genes table in the associations object. I hope this solution is helpful for the time being.

library(dplyr, warn.conflicts = FALSE)
library(glue, warn.conflicts = FALSE)
library(httr, warn.conflicts = FALSE)
library(tidyjson, warn.conflicts = FALSE)
library(purrr)
library(gwasrapidd, warn.conflicts = FALSE)

# For retrieving info about one gene only.
get_gene <- function(gene_name) {

  gene_name <- URLencode(gene_name)
  url <- glue::glue("https://www.ebi.ac.uk/gwas/api/search?q=title%3A+%22{gene_name}%22+AND+resourcename%3Agene")
  response <- httr::GET(url)
  json <- httr::content(response, 'text', encoding = 'UTF-8')

  tbl <-
    json %>%
    tidyjson::enter_object('response', 'docs') %>%
    tidyjson::gather_array() %>%
    tidyjson::spread_values(
      gene_name = tidyjson::jstring('title'),
      description = tidyjson::jstring('description'),
      chromosome_name = tidyjson::jstring('chromosomeName'),
      chromosome_start = tidyjson::jnumber('chromosomeStart'),
      chromosome_end = tidyjson::jnumber('chromosomeEnd'),
      cytogenetic_band = tidyjson::jstring('cytobands'),
      biotype = tidyjson::jstring('biotype')
    ) %>%
    tidyjson::as_tibble() %>%
    dplyr::select(-c('document.id', 'array.index'))

  return(tbl)
}

# Maps over genes
get_genes <- function(gene_names) {
  purrr::map_dfr(gene_names, get_gene)
}

# As an example, let us get some studies about rheumatoid arthritis.
my_studies <- get_studies(efo_trait = 'rheumatoid arthritis')

# Going after those associations related to the first study found only (for the
# sake of simplicity)
one_study <- my_studies[1]
my_associations <- get_associations(study_id = one_study@studies$study_id)
gene_info <- get_genes(my_associations@genes$gene_name)

my_associations@genes <- dplyr::left_join(my_associations@genes, gene_info, by = 'gene_name')

# The slot `genes` of the `associations` object is now augmented to include
# those extra details about each gene.
my_associations@genes
#> # A tibble: 5 × 9
#>   association_id locus_id gene_name description chromosome_name chromosome_start
#>   <chr>             <int> <chr>     <chr>       <chr>                      <dbl>
#> 1 17729                 1 BLK       BLK proto-… 8                       11486894
#> 2 17731                 1 ARHGEF3   Rho guanin… 3                       56727418
#> 3 17883                 1 HLA-DRB1  major hist… 6                       32578769
#> 4 17730                 1 TRHDE     thyrotropi… 12                      72087266
#> 5 40454                 1 PADI4     peptidyl a… 1                       17308195
#> # … with 3 more variables: chromosome_end <dbl>, cytogenetic_band <chr>,
#> #   biotype <chr>
peranti commented 2 years ago

Thanks a lot, @ramiromagno, for this quick hack.

I have tested these functions briefly and made the following observation. If the input contains HLA to the get_genes() function, it fetches the information for even HLA-E, HLA-F etc. Similarly, if the input contains C10orf95, then it will fetch the C10orf95-AS1.

ramiromagno commented 2 years ago

Indeed! It seems that the gene being queried is used as a keyword that will match any gene name containing that keyword. But do not worry, when you do the dplyr::left_join() by gene_name, it has to match exactly. So the final table obtained in my_associations@genes should be as expected.