ropensci / europepmc

R Interface to Europe PMC RESTful Web Service
https://docs.ropensci.org/europepmc
27 stars 8 forks source link

help re: understanding number of hits? #17

Closed sje30 closed 7 years ago

sje30 commented 7 years ago

Hi, I'm trying to compare doing a search directly on the web vs using the package. e.g.

https://europepmc.org/search?query=malaria

says it returns 142856 hits. Yet, the same search in R returns:

> epmc_profile('malaria')
$source
# A tibble: 10 × 2
    name  count
*  <chr>  <int>
1    AGR    121
2    CBA    113
3    CTX      7
4    ETH    179
5    HIR      4
6    MED 115866
7    PAT   2252
8    CIT      0
9    PMC  10363
10   PPR      2

$pubType
# A tibble: 5 × 2
                 name  count
*               <chr>  <int>
1                 ALL 128907
2           FULL TEXT  79224
3         OPEN ACCESS  34723
4              REVIEW  15722
5 BOOKS AND DOCUMENTS     97

$subset
# A tibble: 1 × 2
   name count
* <chr> <int>
1    BL     3

How do I reconcile the two different counts?

njahn82 commented 7 years ago

Europe PMC search on the web uses synonym search by default whereas the API does not. There is a parameter in epmc_search(), but apparently it does not work.

So, while I try to fix the problem, you could simply add synonym param to your query:

europepmc::epmc_profile("malaria&synonym=TRUE")
#> $source
#> # A tibble: 10 × 2
#>     name  count
#> *  <chr>  <int>
#> 1    AGR    121
#> 2    CBA    118
#> 3    CTX      8
#> 4    ETH    239
#> 5    HIR      4
#> 6    MED 129247
#> 7    PAT   2295
#> 8    CIT      0
#> 9    PMC  10831
#> 10   PPR      2
#> 
#> $pubType
#> # A tibble: 5 × 2
#>                  name  count
#> *               <chr>  <int>
#> 1                 ALL 142865
#> 2           FULL TEXT  81313
#> 3         OPEN ACCESS  35226
#> 4              REVIEW  16962
#> 5 BOOKS AND DOCUMENTS     97
#> 
#> $subset
#> # A tibble: 1 × 2
#>    name count
#> * <chr> <int>
#> 1    BL     3
njahn82 commented 7 years ago

Alright, please re-install using devtools. Synonym search for epmc_profile() and epmc_search() is now activated by default to avoid confusion.

devtools::install_github("ropensci/europepmc")
#> Skipping install of 'europepmc' from a github remote, the SHA1 (4b3c1935) has not changed since last install.
#>   Use `force = TRUE` to force installation
# synonym search is on
europepmc::epmc_profile("malaria")
#> $source
#> # A tibble: 10 × 2
#>     name  count
#> *  <chr>  <int>
#> 1    AGR    121
#> 2    CBA    118
#> 3    CTX      8
#> 4    ETH    239
#> 5    HIR      4
#> 6    MED 129247
#> 7    PAT   2295
#> 8    CIT      0
#> 9    PMC  10831
#> 10   PPR      2
#> 
#> $pubType
#> # A tibble: 5 × 2
#>                  name  count
#> *               <chr>  <int>
#> 1                 ALL 142865
#> 2           FULL TEXT  81313
#> 3         OPEN ACCESS  35226
#> 4              REVIEW  16962
#> 5 BOOKS AND DOCUMENTS     97
#> 
#> $subset
#> # A tibble: 1 × 2
#>    name count
#> * <chr> <int>
#> 1    BL     3
# synoym search is de-activated
europepmc::epmc_profile("malaria", synonym = FALSE)
#> $source
#> # A tibble: 10 × 2
#>     name  count
#> *  <chr>  <int>
#> 1    AGR    121
#> 2    CBA    113
#> 3    CTX      7
#> 4    ETH    179
#> 5    HIR      4
#> 6    MED 115875
#> 7    PAT   2252
#> 8    CIT      0
#> 9    PMC  10363
#> 10   PPR      2
#> 
#> $pubType
#> # A tibble: 5 × 2
#>                  name  count
#> *               <chr>  <int>
#> 1                 ALL 128916
#> 2           FULL TEXT  79224
#> 3         OPEN ACCESS  34723
#> 4              REVIEW  15722
#> 5 BOOKS AND DOCUMENTS     97
#> 
#> $subset
#> # A tibble: 1 × 2
#>    name count
#> * <chr> <int>
#> 1    BL     3
sje30 commented 7 years ago

Thank you!