ramiromagno / hgnc

Download and import HGNC gene data into R
https://rmagno.eu/hgnc
Other
3 stars 2 forks source link

Contributions #1

Closed jemunro closed 11 months ago

jemunro commented 1 year ago

Hello,

Thanks for putting together this package, it's very nicely done.

I have an unreleased hgnc package that I was thinking of releasing it to CRAN. However, itmight be a good idea to pool our efforts rather than having multiple hgnc packages with different functionalities.

Are you open to contributions?

Some of the features implemented in my package that could be ported over are:

  1. gene identifier conversion convenience functions, e.g.:
    • hgnc2ensembl & ensembl2hgnc (converts hgnc_id to ensemble_gene_id and vice versa)
    • sym2hgnc & hgnc2sym (converts symbol to hgnc_id and vice versa)
    • etc.
  2. optional persistent disk-based caching
    • use_cache(dir = '~/.hgnc_cache')
    • works best with quarterly/monthly releases:
      • e.g. use_hgnc_version(ver = latest_monthly())
ramiromagno commented 1 year ago

Hello Jacob,

Thank you for reaching out, and I appreciate your kind words about the package.

Your suggestion to collaborate on the {hgnc} package sounds like a great idea!

Feel free to make a pull request, and add your name to the authors in the DESCRIPTION file.

ramiromagno commented 1 year ago

What do you say to an approach based on a left join?

hgnc_id_to_ensembl_id <- function(hgnc_id, file = latest_archive_url()) {

  hgnc_dataset <- import_hgnc_dataset(url)
  hgnc_tbl <- tibble::tibble(hgnc_id)

  dplyr::left_join(x = hgnc_tbl, y = hgnc_dataset[c("hgnc_id", "ensembl_gene_id")], by = "hgnc_id")

}

hgnc_id_to_ensembl_id(c("HGNC:37133", "HGNC:23336"))
jemunro commented 1 year ago

Hi Ramiro,

Thanks for your response.

I added a PR to add disk caching - see https://github.com/ramiromagno/hgnc/pull/3. Very happy to make changes if you have suggestions.

I like your left_join approach, but sometimes it's useful to have a vector based approach rather than a data.frame based approach. How about something like this which implements both:

get_hgnc_key_table <- function(key, columns, unique=TRUE) {

  key_table <-
    import_hgnc_dataset() %>%
    dplyr::select(key = all_of(key),
                  dplyr::all_of(columns)) %>%
    dplyr::filter(!is.na(key)) %>%
    dplyr::distinct()

  if (unique) {
    key_table <-
      key_table %>%
      dplyr::add_count(key) %>%
      dplyr::filter(n == 1) %>%
      dplyr::select(-n)
  } else {
    key_table <-
      key_table %>%
      tidyr::chop(-key)
  }

  key_table <-
    key_table %>%
    dplyr::rename_with(.fn = ~ key, .cols = key)
}

hgnc_join <- function(.data,
                      by = 'hgnc_id',
                      columns = 'symbol',
                      unique = TRUE) {

  key_table <- get_hgnc_key_table(
    key = unname(by),
    columns = columns,
    unique = unique
  )

  dplyr::left_join(
    .data,
    key_table,
    by = by
  )

}

hgnc_convert <- function(x,
                         from = 'hgnc_id',
                         to = 'symbol',
                         unique = TRUE) {

  key_table <-
    get_hgnc_key_table(
      key = from,
      columns = to,
      unique = unique
    ) %>%
    dplyr::rename(
      from = all_of(from),
      to = all_of(to)
    )

  with(key_table, to[match(x, from)])
}

for example:

my_ids <- c('HGNC:15284', 'HGNC:2470', 'HGNC:6505', 'HGNC:36858', 'HGNC:45641')
my_symbols <- hgnc_convert(my_ids, from = 'hgnc_id', to = 'symbol')
my_symbols
[1] "OR5D17P"   "CSRP2"     "RPSAP13"   "RPLP1P12"  "RNU7-107P"
dplyr::tibble(symbol = my_symbols) %>%
  hgnc_join(by = 'symbol',
            columns = c('hgnc_id', 'ensembl_gene_id', 'entrez_id'))
# A tibble: 5 × 4
  symbol    hgnc_id    ensembl_gene_id entrez_id
  <chr>     <chr>      <chr>               <int>
1 OR5D17P   HGNC:15284 ENSG00000181837     81196
2 CSRP2     HGNC:2470  ENSG00000175183      1466
3 RPSAP13   HGNC:6505  ENSG00000233924      3923
4 RPLP1P12  HGNC:36858 NA                 646566
5 RNU7-107P HGNC:45641 ENSG00000238523 106479067

Let me know your thoughts, happy to make a pull request for this.

ramiromagno commented 1 year ago

I have commented on your pull request.