Open Dobrokhotov1989 opened 3 years ago
Thanks for reporting this. I think the discrepancy is due to the synonym query expansion Europe PMC uses. When the synonym search is disabled in epmc_search
using synonym = FALSE
, both functions retrieve the same number of records.
However, it seems that there is a bug in epmc_hits
, because synonym query expansion cannot be activated, although it should. At the bottom of the reprex, there's an example how you can access the number of results returned by epmc_search
until I fix the issue with epmc_hits
.
library(europepmc)
single <- "crispr"
europepmc::epmc_hits(query = single)
#> [1] 74474
europepmc::epmc_search(query = single, synonym = FALSE)
#> 74474 records found, returning 100
#> # A tibble: 100 x 28
#> id source pmid pmcid doi title authorString journalTitle issue
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 33336… MED 33336… PMC7… 10.10… Hepatic… Luo N, Li J, C… Drug Deliv 1
#> 2 33904… MED 33904… PMC8… 10.10… Genomic… Yang F, Zhang … Virulence 1
#> 3 PMC81… PMC <NA> PMC8… <NA> A one-s… Li S, Huang J,… Talanta <NA>
#> 4 33277… MED 33277… <NA> 10.10… Next-Ge… Zeballos C MA,… Trends Biot… 7
#> 5 34140… MED 34140… <NA> 10.10… CRISPR-… Zabrady K, Zab… Nat Commun 1
#> 6 IND60… AGR <NA> <NA> 10.10… Analysi… Pujato S, Gall… Int Dairy J <NA>
#> 7 34152… MED 34152… <NA> 10.10… The CRI… Bire S, Buhan … CRISPR J 3
#> 8 PMC81… PMC <NA> PMC8… <NA> Point-o… Chen F, Lee P,… Biosens Bio… <NA>
#> 9 34152… MED 34152… <NA> 10.10… Diversi… Balderston S, … CRISPR J 3
#> 10 34102… MED 34102… <NA> 10.10… Complex… Khakimzhan A, … Phys Biol <NA>
#> # … with 90 more rows, and 19 more variables: journalVolume <chr>,
#> # pubYear <chr>, journalIssn <chr>, pageInfo <chr>, pubType <chr>,
#> # isOpenAccess <chr>, inEPMC <chr>, inPMC <chr>, hasPDF <chr>, hasBook <chr>,
#> # hasSuppl <chr>, citedByCount <int>, hasReferences <chr>,
#> # hasTextMinedTerms <chr>, hasDbCrossReferences <chr>, hasLabsLinks <chr>,
#> # hasTMAccessionNumbers <chr>, firstIndexDate <chr>,
#> # firstPublicationDate <chr>
one_letter <- "calyculin a" #also might be "mytomicin c"
europepmc::epmc_hits(query = one_letter)
#> [1] 3951
europepmc::epmc_search(query = one_letter, synonym = FALSE)
#> 3951 records found, returning 100
#> # A tibble: 100 x 28
#> id source pmid doi title authorString journalTitle journalVolume
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 33906… MED 33906… 10.10… Constr… Meenakshi C, … Appl Radiat… 173
#> 2 34067… MED 34067… 10.33… Evalua… Zastko L, Rač… Int J Mol S… 22
#> 3 32590… MED 32590… 10.10… A fast… Sun M, Moquet… J Radiol Pr… 40
#> 4 33719… MED 33719… 10.11… Roles … Mukherjee A, … Am J Physio… 320
#> 5 PPR22… PPR <NA> 10.11… Tau pr… Shults NV, Se… <NA> <NA>
#> 6 33026… MED 33026… 10.10… Nonmus… Chinowsky CR,… Mol Biol Ce… 31
#> 7 33125… MED 33125… 10.10… Kineto… Cordeiro MH, … J Cell Biol 219
#> 8 32216… MED 32216… 10.16… A Simp… Sun M, Moquet… Radiat Res 193
#> 9 33901… MED 33901… 10.10… PP1/PP… Maltsev AV, B… Biochem Bio… 558
#> 10 34025… MED 34025… 10.71… Revers… Coarfa C, Gri… J Biomol Te… 32
#> # … with 90 more rows, and 20 more variables: pubYear <chr>, journalIssn <chr>,
#> # pageInfo <chr>, pubType <chr>, isOpenAccess <chr>, inEPMC <chr>,
#> # inPMC <chr>, hasPDF <chr>, hasBook <chr>, hasSuppl <chr>,
#> # citedByCount <int>, hasReferences <chr>, hasTextMinedTerms <chr>,
#> # hasDbCrossReferences <chr>, hasLabsLinks <chr>,
#> # hasTMAccessionNumbers <chr>, firstIndexDate <chr>,
#> # firstPublicationDate <chr>, pmcid <chr>, issue <chr>
two_words <- "physical activity" #also might be "cancer cells"
europepmc::epmc_hits(query = two_words)
#> [1] 1075765
europepmc::epmc_search(query = two_words, synonym = FALSE)
#> 1075765 records found, returning 100
#> # A tibble: 100 x 29
#> id source pmid doi title authorString journalTitle issue journalVolume
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 33383… MED 3338… 10.1… Effe… Willinger N… J Phys Act … 1 18
#> 2 33491… MED 3349… 10.1… Pict… Spencer RA,… Int J Qual … 1 16
#> 3 34075… MED 3407… 10.1… Phys… Alghamdi S,… Int J Qual … 1 16
#> 4 33878… MED 3387… 10.1… Asso… O'Loughlin … J Health Ps… <NA> <NA>
#> 5 34112… MED 3411… 10.1… The … Cheval B, B… Exerc Sport… 3 49
#> 6 PMC82… PMC <NA> <NA> Perc… Aartolahti … Int J Envir… 11 18
#> 7 33962… MED 3396… 10.1… Phys… Masquelier … Joint Bone … 5 88
#> 8 33962… MED 3396… 10.1… Mult… Lecat CSY, … BMC Res Not… 1 14
#> 9 33780… MED 3378… 10.1… Clas… Mavilidi MF… Acta Paedia… <NA> <NA>
#> 10 33373… MED 3337… 10.1… Can … Jakobsen DD… J Phys Act … 1 18
#> # … with 90 more rows, and 20 more variables: pubYear <chr>, journalIssn <chr>,
#> # pageInfo <chr>, pubType <chr>, isOpenAccess <chr>, inEPMC <chr>,
#> # inPMC <chr>, hasPDF <chr>, hasBook <chr>, hasSuppl <chr>,
#> # citedByCount <int>, hasReferences <chr>, hasTextMinedTerms <chr>,
#> # hasDbCrossReferences <chr>, hasLabsLinks <chr>,
#> # hasTMAccessionNumbers <chr>, firstIndexDate <chr>,
#> # firstPublicationDate <chr>, pmcid <chr>, versionNumber <int>
# if you need to store the hit count, please try the following:
my_hits <- europepmc::epmc_search(query = two_words, synonym = FALSE)
#> 1075765 records found, returning 100
attr(my_hits, "hit_count")
#> [1] 1075765
my_hits_synonym <- europepmc::epmc_search(query = two_words, synonym = TRUE)
#> 3407234 records found, returning 100
attr(my_hits_synonym, "hit_count")
#> [1] 3407234
Created on 2021-06-22 by the reprex package (v2.0.0)
Hi, I noticed a small discrepancy between
epmc_hits()
andepmc_search()
: with an identical query, they return a different number of found records. Do these functions manipulate query arguments differently?epmc_hits()
systematically underestimates the number of records, but it matches the number of records returned on the EuropePMC website. In my script, I useepmc_hits()
to set a limit forepmc_search()
, i.e. download all records. However, with that difference in function behavior, I will systematically lose some records.Here is reprex: