Closed jooolia closed 2 years ago
For example something like this (or maf_resource
and MAF
concatenated):
aa <- ncbi_snp_query("rs1610720", maf_resource = c("Estonian", "TOPMED"))
expect_equal(aa$maf_resource, c("Estonian", "TOPMED"))
expect_equal(aa$MAF, c(0.307143, 0.421397))
@jooolia dbSNP has changed how they report allele frequencies. It seems it is more standardized now: https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/
Regardless - the problem is in a way that previously we reported one line per SNP, but now we would get several lines per SNP due to the many populations. For example for rs45
it would be:
study ref_seq Minor MAF
1 1000Genomes A C 0.7907348
2 ALSPAC A C 0.7088739
17 dbGaP_PopFreq A C 0.7120658
3 Estonian A C 0.6812500
4 GENOME_DK A C 0.7750000
5 GnomAD A C 0.7583270
6 GoNL A C 0.6492986
7 HapMap A C 0.8384146
9 Korea1K A C 0.7581878
8 KOREAN A C 0.7590444
18 KOREAN A G 0.0000000
19 KOREAN A T 0.0000000
10 NorthernSweden A C 0.6450000
11 Qatari A C 0.8148148
12 SGDP_PRJ A C 0.8250000
13 Siberian A C 0.6875000
14 TOPMED A C 0.7783990
15 TWINSUK A C 0.7192557
16 Vietnamese A C 0.7500000
One workaround I see, is to add the mafs as a list to the dataframe. I implemented this tentatively (and far from perfect) in this branch using tibble
: https://github.com/ropensci/rsnps/tree/T97/sr.
out <- ncbi_snp_query(c("rs45", "rs44"))
out
## query chromosome bp class rsid gene alleles ancestral_allele variation_allele seqname
## <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 rs45 7 1.15e7 snv rs45 THSD… A,C,G,T A C,G,T NC_000…
## 2 rs44 7 1.15e7 snv rs44 THSD… A,C,T A C,T NC_000…
# … with 6 more variables: hgvs <chr>, assembly <chr>, ref_seq <chr>, minor <chr>, maf <dbl>,
# maf_population <list>
mat$maf_population[[1]][1,]
## study ref_seq Minor MAF
## 1 1000Genomes A C 0.7907348
The downside is, that we would need to import tibble
.
We could have an additional support function that would pull out the frequencies of a particular study, e.g.
ncbi_frequency(out, study = "goNL")
.
What do you think?
Hi @sinarueeger, I think it looks nice and the code quite reasonable. One concern I might have is if users are used to getting one value back and then they get a list col would that be problematic? (just brainstorming here, not sure if it makes sense) but would it make sense or be possible to have one study returned from ncbi_snp_query() and then ncbi_frequency() could return a list of frequencies?
Yes, that could be an option, so:
Option 1: we'd keep ncbi_snp_query()
as is, and add ncbi_frequency()
to pull out the extra frequencies.
Option 2: we keep ncbi_snp_query()
as is, including the maf
column, but add the list of maf_population
.
Pro option 1: nothing breaks, Con: users will need to use an extra function, extract the population, then join it to chr and pos. Pro option2: All compact Con: users will need to deal with lists in dataframes (we could provide a neat function though).
Not sure what is best practices here.
What do you think @jooolia?
I think that option 2 would be the most straightforward to me. :) What do you think? Do you want to open a draft PR (or not draft!) and we can discuss further?
Hi @jooolia, let's go for option 2 then. I can open a draft PR some time this week (or whoever is first).
done in #142
As brought up in #92, it would be nice to add support and tests for databases other than gnoMaD when using ncbi_snp_query_api().