ropensci / europepmc

R Interface to Europe PMC RESTful Web Service
https://docs.ropensci.org/europepmc
27 stars 8 forks source link

Different number of records returned #39

Open Dobrokhotov1989 opened 3 years ago

Dobrokhotov1989 commented 3 years ago

Hi, I noticed a small discrepancy between epmc_hits() and epmc_search(): with an identical query, they return a different number of found records. Do these functions manipulate query arguments differently? epmc_hits() systematically underestimates the number of records, but it matches the number of records returned on the EuropePMC website. In my script, I use epmc_hits() to set a limit for epmc_search(), i.e. download all records. However, with that difference in function behavior, I will systematically lose some records.

Here is reprex:

single <- "crispr"
europepmc::epmc_hits(query = single)
#> [1] 74467
europepmc::epmc_search(query = single)
#> 75450 records found, returning 100
#> # A tibble: 100 x 28
#>    id    source pmid  pmcid doi   title authorString journalTitle journalVolume
#>    <chr> <chr>  <chr> <chr> <chr> <chr> <chr>        <chr>        <chr>        
#>  1 3413~ MED    3413~ PMC8~ 10.1~ A si~ Chen J, Sch~ MicroPubl B~ 2021         

one_letter <- "calyculin a" #also might be "mytomicin c"
europepmc::epmc_hits(query = one_letter)
#> [1] 3951
europepmc::epmc_search(query = one_letter)
#> 3971 records found, returning 100
#> # A tibble: 100 x 28
#>    id    source pmid  pmcid doi   title authorString journalTitle issue
#>    <chr> <chr>  <chr> <chr> <chr> <chr> <chr>        <chr>        <chr>
#>  1 3406~ MED    3406~ PMC8~ 10.3~ Eval~ Zastko L, R~ Int J Mol S~ 11   

two_words <- "physical activity" #also might be "cancer cells"
europepmc::epmc_hits(query = two_words)
#> [1] 1075740
europepmc::epmc_search(query = two_words)
#> 3407174 records found, returning 100
#> # A tibble: 100 x 29
#>    id    source pmid  doi   title authorString journalTitle issue journalVolume
#>    <chr> <chr>  <chr> <chr> <chr> <chr>        <chr>        <chr> <chr>        
#>  1 3338~ MED    3338~ 10.1~ Effe~ Willinger N~ J Phys Act ~ 1     18           
njahn82 commented 3 years ago

Thanks for reporting this. I think the discrepancy is due to the synonym query expansion Europe PMC uses. When the synonym search is disabled in epmc_search using synonym = FALSE, both functions retrieve the same number of records.

However, it seems that there is a bug in epmc_hits, because synonym query expansion cannot be activated, although it should. At the bottom of the reprex, there's an example how you can access the number of results returned by epmc_search until I fix the issue with epmc_hits.

library(europepmc)
single <- "crispr"
europepmc::epmc_hits(query = single)
#> [1] 74474
europepmc::epmc_search(query = single, synonym = FALSE)
#> 74474 records found, returning 100
#> # A tibble: 100 x 28
#>    id     source pmid   pmcid doi    title    authorString    journalTitle issue
#>    <chr>  <chr>  <chr>  <chr> <chr>  <chr>    <chr>           <chr>        <chr>
#>  1 33336… MED    33336… PMC7… 10.10… Hepatic… Luo N, Li J, C… Drug Deliv   1    
#>  2 33904… MED    33904… PMC8… 10.10… Genomic… Yang F, Zhang … Virulence    1    
#>  3 PMC81… PMC    <NA>   PMC8… <NA>   A one-s… Li S, Huang J,… Talanta      <NA> 
#>  4 33277… MED    33277… <NA>  10.10… Next-Ge… Zeballos C MA,… Trends Biot… 7    
#>  5 34140… MED    34140… <NA>  10.10… CRISPR-… Zabrady K, Zab… Nat Commun   1    
#>  6 IND60… AGR    <NA>   <NA>  10.10… Analysi… Pujato S, Gall… Int Dairy J  <NA> 
#>  7 34152… MED    34152… <NA>  10.10… The CRI… Bire S, Buhan … CRISPR J     3    
#>  8 PMC81… PMC    <NA>   PMC8… <NA>   Point-o… Chen F, Lee P,… Biosens Bio… <NA> 
#>  9 34152… MED    34152… <NA>  10.10… Diversi… Balderston S, … CRISPR J     3    
#> 10 34102… MED    34102… <NA>  10.10… Complex… Khakimzhan A, … Phys Biol    <NA> 
#> # … with 90 more rows, and 19 more variables: journalVolume <chr>,
#> #   pubYear <chr>, journalIssn <chr>, pageInfo <chr>, pubType <chr>,
#> #   isOpenAccess <chr>, inEPMC <chr>, inPMC <chr>, hasPDF <chr>, hasBook <chr>,
#> #   hasSuppl <chr>, citedByCount <int>, hasReferences <chr>,
#> #   hasTextMinedTerms <chr>, hasDbCrossReferences <chr>, hasLabsLinks <chr>,
#> #   hasTMAccessionNumbers <chr>, firstIndexDate <chr>,
#> #   firstPublicationDate <chr>

one_letter <- "calyculin a" #also might be "mytomicin c"
europepmc::epmc_hits(query = one_letter)
#> [1] 3951
europepmc::epmc_search(query = one_letter, synonym = FALSE)
#> 3951 records found, returning 100
#> # A tibble: 100 x 28
#>    id     source pmid   doi    title   authorString   journalTitle journalVolume
#>    <chr>  <chr>  <chr>  <chr>  <chr>   <chr>          <chr>        <chr>        
#>  1 33906… MED    33906… 10.10… Constr… Meenakshi C, … Appl Radiat… 173          
#>  2 34067… MED    34067… 10.33… Evalua… Zastko L, Rač… Int J Mol S… 22           
#>  3 32590… MED    32590… 10.10… A fast… Sun M, Moquet… J Radiol Pr… 40           
#>  4 33719… MED    33719… 10.11… Roles … Mukherjee A, … Am J Physio… 320          
#>  5 PPR22… PPR    <NA>   10.11… Tau pr… Shults NV, Se… <NA>         <NA>         
#>  6 33026… MED    33026… 10.10… Nonmus… Chinowsky CR,… Mol Biol Ce… 31           
#>  7 33125… MED    33125… 10.10… Kineto… Cordeiro MH, … J Cell Biol  219          
#>  8 32216… MED    32216… 10.16… A Simp… Sun M, Moquet… Radiat Res   193          
#>  9 33901… MED    33901… 10.10… PP1/PP… Maltsev AV, B… Biochem Bio… 558          
#> 10 34025… MED    34025… 10.71… Revers… Coarfa C, Gri… J Biomol Te… 32           
#> # … with 90 more rows, and 20 more variables: pubYear <chr>, journalIssn <chr>,
#> #   pageInfo <chr>, pubType <chr>, isOpenAccess <chr>, inEPMC <chr>,
#> #   inPMC <chr>, hasPDF <chr>, hasBook <chr>, hasSuppl <chr>,
#> #   citedByCount <int>, hasReferences <chr>, hasTextMinedTerms <chr>,
#> #   hasDbCrossReferences <chr>, hasLabsLinks <chr>,
#> #   hasTMAccessionNumbers <chr>, firstIndexDate <chr>,
#> #   firstPublicationDate <chr>, pmcid <chr>, issue <chr>

two_words <- "physical activity" #also might be "cancer cells"
europepmc::epmc_hits(query = two_words)
#> [1] 1075765

europepmc::epmc_search(query = two_words, synonym = FALSE)
#> 1075765 records found, returning 100
#> # A tibble: 100 x 29
#>    id     source pmid  doi   title authorString journalTitle issue journalVolume
#>    <chr>  <chr>  <chr> <chr> <chr> <chr>        <chr>        <chr> <chr>        
#>  1 33383… MED    3338… 10.1… Effe… Willinger N… J Phys Act … 1     18           
#>  2 33491… MED    3349… 10.1… Pict… Spencer RA,… Int J Qual … 1     16           
#>  3 34075… MED    3407… 10.1… Phys… Alghamdi S,… Int J Qual … 1     16           
#>  4 33878… MED    3387… 10.1… Asso… O'Loughlin … J Health Ps… <NA>  <NA>         
#>  5 34112… MED    3411… 10.1… The … Cheval B, B… Exerc Sport… 3     49           
#>  6 PMC82… PMC    <NA>  <NA>  Perc… Aartolahti … Int J Envir… 11    18           
#>  7 33962… MED    3396… 10.1… Phys… Masquelier … Joint Bone … 5     88           
#>  8 33962… MED    3396… 10.1… Mult… Lecat CSY, … BMC Res Not… 1     14           
#>  9 33780… MED    3378… 10.1… Clas… Mavilidi MF… Acta Paedia… <NA>  <NA>         
#> 10 33373… MED    3337… 10.1… Can … Jakobsen DD… J Phys Act … 1     18           
#> # … with 90 more rows, and 20 more variables: pubYear <chr>, journalIssn <chr>,
#> #   pageInfo <chr>, pubType <chr>, isOpenAccess <chr>, inEPMC <chr>,
#> #   inPMC <chr>, hasPDF <chr>, hasBook <chr>, hasSuppl <chr>,
#> #   citedByCount <int>, hasReferences <chr>, hasTextMinedTerms <chr>,
#> #   hasDbCrossReferences <chr>, hasLabsLinks <chr>,
#> #   hasTMAccessionNumbers <chr>, firstIndexDate <chr>,
#> #   firstPublicationDate <chr>, pmcid <chr>, versionNumber <int>

# if you need to store the hit count, please try the following:
my_hits <- europepmc::epmc_search(query = two_words, synonym = FALSE)
#> 1075765 records found, returning 100
attr(my_hits, "hit_count")
#> [1] 1075765
my_hits_synonym <- europepmc::epmc_search(query = two_words, synonym = TRUE)
#> 3407234 records found, returning 100
attr(my_hits_synonym, "hit_count")
#> [1] 3407234

Created on 2021-06-22 by the reprex package (v2.0.0)