ropensci-archive / rplos

:warning: ARCHIVED :warning: R client for the PLoS Journals API
Other
316 stars 107 forks source link

Duplicate articles when collecting PLOS One articles for a year #119

Closed Bubblbu closed 6 years ago

Bubblbu commented 6 years ago

I'm collecting all PLOS One articles for 2014.

pub_dates = paste0('publication_date:[2014-01-01T00:00:00Z TO 2014-12-31T23:59:59Z]')
journal = 'journal_key:PLoSONE'
doc_type = 'doc_type:full'

fl = 'id,publication_date,title,author'
fq = list(journal, pub_dates, doc_type)

searchplos(q="*:*", fl=fl, fq=fq, limit=0)$meta$numFound

"Found 31883 articles"

I then continue to extract and save DOI and some basic metadata.

doi = c()
publication_date = c()
author = c()
title = c()

for (i in seq(0, numFound, batch_size)) {
  r = searchplos(q="*:*", fl=fl, fq=fq, start=i, limit=batch_size, sleep=6)

  doi = c(doi, r$data$id)
  publication_date = c(publication_date, r$data$publication_date)
  author = c(author, r$data$author)
  title = c(title, r$data$title)

  print(paste("Processed", length(r$data$id), "entries | batch", i))
}

plos = data.frame(doi=doi, publication_date=publication_date, title=title, author=author)
write.csv(plos, "plos.csv", row.names = FALSE)

Results

The resulting DF contains 31,883 entries, but 4,736 duplicate DOIs (up to 4 duplicate DOIs).

Question

Is this a bug?

If not, can I simply deduplicate the results or should I assume that I am missing certain entries and what are the extra steps to re-query the missing data?

Session Info ```r Session info ------------------------------------------------------------------------------------------------------------------------------ setting value version R version 3.4.4 (2018-03-15) system x86_64, linux-gnu ui RStudio (1.1.453) language en collate en_US.UTF-8 tz America/Vancouver date 2018-07-16 Packages ---------------------------------------------------------------------------------------------------------------------------------- package * version date source assertthat 0.1 2013-12-06 CRAN (R 3.3.1) colorspace 1.3-2 2016-12-14 CRAN (R 3.3.2) crul 0.5.2 2018-02-24 CRAN (R 3.4.4) curl 3.2 2018-03-28 CRAN (R 3.4.4) DBI 0.5-1 2016-09-10 CRAN (R 3.3.1) devtools 1.12.0 2016-12-05 CRAN (R 3.3.2) digest 0.6.12 2017-01-27 CRAN (R 3.3.2) dplyr 0.5.0 2016-06-24 CRAN (R 3.3.1) ggplot2 2.2.1 2016-12-30 CRAN (R 3.3.2) gtable 0.2.0 2016-02-26 CRAN (R 3.3.1) jsonlite 1.5 2017-06-01 cran (@1.5) lazyeval 0.2.0 2016-06-12 CRAN (R 3.3.1) lubridate 1.6.0 2016-09-13 CRAN (R 3.3.1) magrittr 1.5 2014-11-22 CRAN (R 3.3.1) memoise 1.0.0 2016-01-29 CRAN (R 3.3.2) munsell 0.4.3 2016-02-13 CRAN (R 3.3.1) pillar 1.2.2 2018-04-26 CRAN (R 3.4.4) plyr 1.8.4 2016-06-08 CRAN (R 3.3.1) R6 2.2.2 2017-06-17 cran (@2.2.2) Rcpp 0.12.16 2018-03-13 CRAN (R 3.4.4) reshape2 1.4.2 2016-10-22 CRAN (R 3.3.2) rlang 0.2.0 2018-02-20 CRAN (R 3.4.4) rplos * 0.8.0 2017-11-03 CRAN (R 3.4.4) scales 0.4.1 2016-11-09 CRAN (R 3.3.2) solrium 1.0.0 2017-11-02 CRAN (R 3.4.4) stringi 1.1.2 2016-10-01 CRAN (R 3.3.2) stringr 1.2.0 2017-02-18 CRAN (R 3.3.2) tibble 1.4.2 2018-01-22 CRAN (R 3.4.4) whisker 0.3-2 2013-04-28 CRAN (R 3.3.1) withr 2.1.2 2018-03-15 CRAN (R 3.4.4) xml2 1.1.1 2017-01-24 CRAN (R 3.3.2) yaml 2.1.19 2018-05-01 CRAN (R 3.4.4) ```
sckott commented 6 years ago

thx will have a look

sckott commented 6 years ago

@Bubblbu your code above is missing the fl and fq values, what are they?

Bubblbu commented 6 years ago

oh, you're right! here it is:

pub_dates = paste0('publication_date:[2014-01-01T00:00:00Z TO 2014-12-31T23:59:59Z]')
journal = 'journal_key:PLoSONE'
doc_type = 'doc_type:full'

fl = 'id,publication_date,title,author'
fq = list(journal, pub_dates, doc_type)
sckott commented 6 years ago

there's also no id in your script, see it referenced within the for loop

Bubblbu commented 6 years ago

aaah sry; updated the original code snippet...

sckott commented 6 years ago

@Bubblbu i'd forgotten that we do internal paging. so you can just do

pub_dates = paste0('publication_date:[2014-01-01T00:00:00Z TO 2014-12-31T23:59:59Z]')
journal = 'journal_key:PLoSONE'
doc_type = 'doc_type:full'
fl = 'id,publication_date,title,author'
fq = list(journal, pub_dates, doc_type)
numFound = searchplos(q="*:*", fl=journal, fq=doc_type, limit=0)$meta$numFound
searchplos(q = "*:*", fl = fl, fq = fq, limit = numFound)

#> $meta
#> # A tibble: 1 x 2
#>   numFound start
#>      <int> <int>
#> 1  1746003 19501
#> 
#> $data
#> # A tibble: 20,000 x 4
#>    id                                  publication_date   author                                                                                                               title
#>    <chr>                               <chr>              <chr>                                                                                                                <chr>
#>  1 10.1371/journal.pone.0030394/intro… 2012-01-23T00:00:… Wei-Yao Wang,Tzong-Shi Chiueh,Jun-Ren Sun,Shin-Ming Tsao,Jang-Jih Lu                                                 NA
#>  2 10.1371/journal.pone.0030394/resul… 2012-01-23T00:00:… Wei-Yao Wang,Tzong-Shi Chiueh,Jun-Ren Sun,Shin-Ming Tsao,Jang-Jih Lu                                                 NA
#>  3 10.1371/journal.pone.0002157/mater… 2008-05-14T00:00:… Markus Pfenninger,Carsten Nowak                                                                                      NA
#>  4 10.1371/journal.pone.0030394/suppo… 2012-01-23T00:00:… Wei-Yao Wang,Tzong-Shi Chiueh,Jun-Ren Sun,Shin-Ming Tsao,Jang-Jih Lu                                                 NA
#>  5 10.1371/journal.pone.0044137/mater… 2012-09-19T00:00:… Esmeralda Morillo,María Antonia Sánchez-Trujillo,José Ramón Moyano,Jaime Villaverde,María Eulalia Gómez-Pantoja,Jos… NA
#>  6 10.1371/journal.pone.0113465/mater… 2014-12-17T00:00:… Patrick Durez,Pierre Vandepapeliere,Pedro Miranda,Antoaneta Toncheva,Alberto Berman,Tatjana Kehler,Eugenia Mociran,… NA
#>  7 10.1371/journal.pone.0099112/intro… 2014-06-10T00:00:… Li Qi,Felix G Meinel,Chang Sheng Zhou,Yan E Zhao,U Joseph Schoepf,Long Jiang Zhang,Guang Ming Lu                     NA
#>  8 10.1371/journal.pone.0099112/resul… 2014-06-10T00:00:… Li Qi,Felix G Meinel,Chang Sheng Zhou,Yan E Zhao,U Joseph Schoepf,Long Jiang Zhang,Guang Ming Lu                     NA
#>  9 10.1371/journal.pone.0099112/mater… 2014-06-10T00:00:… Li Qi,Felix G Meinel,Chang Sheng Zhou,Yan E Zhao,U Joseph Schoepf,Long Jiang Zhang,Guang Ming Lu                     NA
#> 10 10.1371/journal.pone.0155488/title  2016-05-20T00:00:… Jirayu Tanprasertsuk,Binxing Li,Paul S Bernstein,Rohini Vishwanathan,Mary Ann Johnson,Leonard Poon,Elizabeth J John… NA
#> # ... with 19,990 more rows
Bubblbu commented 6 years ago

Thanks! That resolves duplicate entries and is nicer too :)

Bubblbu commented 6 years ago

Maybe it would be indicate that internal looping is implemented in the documentation for searchplos. I think that this threw me off and made me assume that manual looping is required:

start -- Record to start at (used in combination with limit when you need to cycle through more results than the max allowed=1000)

sckott commented 6 years ago

@Bubblbu can you close this issue and open a new one for the progress issue