Add support for different frequency databases to ncbi_snp_query_api()

jooolia commented 4 years ago

As brought up in #92, it would be nice to add support and tests for databases other than gnoMaD when using ncbi_snp_query_api().

sinarueeger commented 4 years ago

Test

For example something like this (or maf_resource and MAF concatenated):

  aa <- ncbi_snp_query("rs1610720", maf_resource = c("Estonian", "TOPMED"))
  expect_equal(aa$maf_resource,  c("Estonian", "TOPMED"))
  expect_equal(aa$MAF, c(0.307143, 0.421397))

sinarueeger commented 3 years ago

@jooolia dbSNP has changed how they report allele frequencies. It seems it is more standardized now: https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/

Regardless - the problem is in a way that previously we reported one line per SNP, but now we would get several lines per SNP due to the many populations. For example for rs45 it would be:

           study ref_seq Minor       MAF
1     1000Genomes       A     C 0.7907348
2          ALSPAC       A     C 0.7088739
17  dbGaP_PopFreq       A     C 0.7120658
3        Estonian       A     C 0.6812500
4       GENOME_DK       A     C 0.7750000
5          GnomAD       A     C 0.7583270
6            GoNL       A     C 0.6492986
7          HapMap       A     C 0.8384146
9         Korea1K       A     C 0.7581878
8          KOREAN       A     C 0.7590444
18         KOREAN       A     G 0.0000000
19         KOREAN       A     T 0.0000000
10 NorthernSweden       A     C 0.6450000
11         Qatari       A     C 0.8148148
12       SGDP_PRJ       A     C 0.8250000
13       Siberian       A     C 0.6875000
14         TOPMED       A     C 0.7783990
15        TWINSUK       A     C 0.7192557
16     Vietnamese       A     C 0.7500000

One workaround I see, is to add the mafs as a list to the dataframe. I implemented this tentatively (and far from perfect) in this branch using tibble: https://github.com/ropensci/rsnps/tree/T97/sr.

out <- ncbi_snp_query(c("rs45", "rs44"))
out

## query chromosome     bp class rsid  gene  alleles ancestral_allele variation_allele seqname
##  <chr> <chr>       <dbl> <chr> <chr> <chr> <chr>   <chr>            <chr>            <chr>  
## 1 rs45  7          1.15e7 snv   rs45  THSD… A,C,G,T A                C,G,T            NC_000…
## 2 rs44  7          1.15e7 snv   rs44  THSD… A,C,T   A                C,T              NC_000…
# … with 6 more variables: hgvs <chr>, assembly <chr>, ref_seq <chr>, minor <chr>, maf <dbl>,
#   maf_population <list>

mat$maf_population[[1]][1,]
##        study ref_seq Minor       MAF
## 1 1000Genomes       A     C 0.7907348

The downside is, that we would need to import tibble.

We could have an additional support function that would pull out the frequencies of a particular study, e.g. ncbi_frequency(out, study = "goNL").

What do you think?

jooolia commented 3 years ago

Hi @sinarueeger, I think it looks nice and the code quite reasonable. One concern I might have is if users are used to getting one value back and then they get a list col would that be problematic? (just brainstorming here, not sure if it makes sense) but would it make sense or be possible to have one study returned from ncbi_snp_query() and then ncbi_frequency() could return a list of frequencies?

sinarueeger commented 3 years ago

Yes, that could be an option, so:

Option 1: we'd keep ncbi_snp_query() as is, and add ncbi_frequency() to pull out the extra frequencies.

Option 2: we keep ncbi_snp_query() as is, including the maf column, but add the list of maf_population.

Pro option 1: nothing breaks, Con: users will need to use an extra function, extract the population, then join it to chr and pos. Pro option2: All compact Con: users will need to deal with lists in dataframes (we could provide a neat function though).

Not sure what is best practices here.

What do you think @jooolia?

jooolia commented 3 years ago

I think that option 2 would be the most straightforward to me. :) What do you think? Do you want to open a draft PR (or not draft!) and we can discuss further?

sinarueeger commented 3 years ago

Hi @jooolia, let's go for option 2 then. I can open a draft PR some time this week (or whoever is first).

jooolia commented 2 years ago

done in #142

ropensci / rsnps

Add support for different frequency databases to ncbi_snp_query_api() #97

Test