Closed Bubblbu closed 6 years ago
thx will have a look
@Bubblbu your code above is missing the fl
and fq
values, what are they?
oh, you're right! here it is:
pub_dates = paste0('publication_date:[2014-01-01T00:00:00Z TO 2014-12-31T23:59:59Z]')
journal = 'journal_key:PLoSONE'
doc_type = 'doc_type:full'
fl = 'id,publication_date,title,author'
fq = list(journal, pub_dates, doc_type)
there's also no id
in your script, see it referenced within the for loop
aaah sry; updated the original code snippet...
@Bubblbu i'd forgotten that we do internal paging. so you can just do
pub_dates = paste0('publication_date:[2014-01-01T00:00:00Z TO 2014-12-31T23:59:59Z]')
journal = 'journal_key:PLoSONE'
doc_type = 'doc_type:full'
fl = 'id,publication_date,title,author'
fq = list(journal, pub_dates, doc_type)
numFound = searchplos(q="*:*", fl=journal, fq=doc_type, limit=0)$meta$numFound
searchplos(q = "*:*", fl = fl, fq = fq, limit = numFound)
#> $meta
#> # A tibble: 1 x 2
#> numFound start
#> <int> <int>
#> 1 1746003 19501
#>
#> $data
#> # A tibble: 20,000 x 4
#> id publication_date author title
#> <chr> <chr> <chr> <chr>
#> 1 10.1371/journal.pone.0030394/intro… 2012-01-23T00:00:… Wei-Yao Wang,Tzong-Shi Chiueh,Jun-Ren Sun,Shin-Ming Tsao,Jang-Jih Lu NA
#> 2 10.1371/journal.pone.0030394/resul… 2012-01-23T00:00:… Wei-Yao Wang,Tzong-Shi Chiueh,Jun-Ren Sun,Shin-Ming Tsao,Jang-Jih Lu NA
#> 3 10.1371/journal.pone.0002157/mater… 2008-05-14T00:00:… Markus Pfenninger,Carsten Nowak NA
#> 4 10.1371/journal.pone.0030394/suppo… 2012-01-23T00:00:… Wei-Yao Wang,Tzong-Shi Chiueh,Jun-Ren Sun,Shin-Ming Tsao,Jang-Jih Lu NA
#> 5 10.1371/journal.pone.0044137/mater… 2012-09-19T00:00:… Esmeralda Morillo,María Antonia Sánchez-Trujillo,José Ramón Moyano,Jaime Villaverde,María Eulalia Gómez-Pantoja,Jos… NA
#> 6 10.1371/journal.pone.0113465/mater… 2014-12-17T00:00:… Patrick Durez,Pierre Vandepapeliere,Pedro Miranda,Antoaneta Toncheva,Alberto Berman,Tatjana Kehler,Eugenia Mociran,… NA
#> 7 10.1371/journal.pone.0099112/intro… 2014-06-10T00:00:… Li Qi,Felix G Meinel,Chang Sheng Zhou,Yan E Zhao,U Joseph Schoepf,Long Jiang Zhang,Guang Ming Lu NA
#> 8 10.1371/journal.pone.0099112/resul… 2014-06-10T00:00:… Li Qi,Felix G Meinel,Chang Sheng Zhou,Yan E Zhao,U Joseph Schoepf,Long Jiang Zhang,Guang Ming Lu NA
#> 9 10.1371/journal.pone.0099112/mater… 2014-06-10T00:00:… Li Qi,Felix G Meinel,Chang Sheng Zhou,Yan E Zhao,U Joseph Schoepf,Long Jiang Zhang,Guang Ming Lu NA
#> 10 10.1371/journal.pone.0155488/title 2016-05-20T00:00:… Jirayu Tanprasertsuk,Binxing Li,Paul S Bernstein,Rohini Vishwanathan,Mary Ann Johnson,Leonard Poon,Elizabeth J John… NA
#> # ... with 19,990 more rows
Thanks! That resolves duplicate entries and is nicer too :)
Maybe it would be indicate that internal looping is implemented in the documentation for searchplos
. I think that this threw me off and made me assume that manual looping is required:
start -- Record to start at (used in combination with limit when you need to cycle through more results than the max allowed=1000)
@Bubblbu can you close this issue and open a new one for the progress issue
I'm collecting all PLOS One articles for 2014.
"Found 31883 articles"
I then continue to extract and save DOI and some basic metadata.
Results
The resulting DF contains 31,883 entries, but 4,736 duplicate DOIs (up to 4 duplicate DOIs).
Question
Is this a bug?
If not, can I simply deduplicate the results or should I assume that I am missing certain entries and what are the extra steps to re-query the missing data?
Session Info
```r Session info ------------------------------------------------------------------------------------------------------------------------------ setting value version R version 3.4.4 (2018-03-15) system x86_64, linux-gnu ui RStudio (1.1.453) language en collate en_US.UTF-8 tz America/Vancouver date 2018-07-16 Packages ---------------------------------------------------------------------------------------------------------------------------------- package * version date source assertthat 0.1 2013-12-06 CRAN (R 3.3.1) colorspace 1.3-2 2016-12-14 CRAN (R 3.3.2) crul 0.5.2 2018-02-24 CRAN (R 3.4.4) curl 3.2 2018-03-28 CRAN (R 3.4.4) DBI 0.5-1 2016-09-10 CRAN (R 3.3.1) devtools 1.12.0 2016-12-05 CRAN (R 3.3.2) digest 0.6.12 2017-01-27 CRAN (R 3.3.2) dplyr 0.5.0 2016-06-24 CRAN (R 3.3.1) ggplot2 2.2.1 2016-12-30 CRAN (R 3.3.2) gtable 0.2.0 2016-02-26 CRAN (R 3.3.1) jsonlite 1.5 2017-06-01 cran (@1.5) lazyeval 0.2.0 2016-06-12 CRAN (R 3.3.1) lubridate 1.6.0 2016-09-13 CRAN (R 3.3.1) magrittr 1.5 2014-11-22 CRAN (R 3.3.1) memoise 1.0.0 2016-01-29 CRAN (R 3.3.2) munsell 0.4.3 2016-02-13 CRAN (R 3.3.1) pillar 1.2.2 2018-04-26 CRAN (R 3.4.4) plyr 1.8.4 2016-06-08 CRAN (R 3.3.1) R6 2.2.2 2017-06-17 cran (@2.2.2) Rcpp 0.12.16 2018-03-13 CRAN (R 3.4.4) reshape2 1.4.2 2016-10-22 CRAN (R 3.3.2) rlang 0.2.0 2018-02-20 CRAN (R 3.4.4) rplos * 0.8.0 2017-11-03 CRAN (R 3.4.4) scales 0.4.1 2016-11-09 CRAN (R 3.3.2) solrium 1.0.0 2017-11-02 CRAN (R 3.4.4) stringi 1.1.2 2016-10-01 CRAN (R 3.3.2) stringr 1.2.0 2017-02-18 CRAN (R 3.3.2) tibble 1.4.2 2018-01-22 CRAN (R 3.4.4) whisker 0.3-2 2013-04-28 CRAN (R 3.3.1) withr 2.1.2 2018-03-15 CRAN (R 3.4.4) xml2 1.1.1 2017-01-24 CRAN (R 3.3.2) yaml 2.1.19 2018-05-01 CRAN (R 3.4.4) ```