tklebel commented 6 years ago

Summary

What does this package do? (explain in 50 words or less):

The tool Data for Research (DfR) by JSTOR is a valuable source for citation analysis and text mining. jstor provides functions and suggests workflows for importing datasets from DfR.

Paste the full DESCRIPTION file inside a code block below:

Package: jstor
Title: Read Data from JSTOR/DfR
Version: 0.2.6
Authors@R: person("Thomas", "Klebel", email = "thomas.klebel@uni-graz.at", 
                  role = c("aut", "cre"))
Description: Functions and helpers to import metadata and full-texts delivered
    by Data for Research (DfR) by JSTOR. 
Depends: R (>= 3.1)
License: GPL-3 | file LICENSE
Encoding: UTF-8
LazyData: true
Imports: 
    dplyr,
    purrr,
    xml2,
    magrittr,
    stringr,
    readr,
    tibble,
    rlang,
    foreach,
    doSNOW,
    snow
Suggests: testthat,
    covr,
    knitr,
    rmarkdown,
    tidyr
BugReports: https://github.com/tklebel/jstor/issues
URL: https://github.com/tklebel/jstor, https://tklebel.github.io/jstor/
RoxygenNote: 6.0.1
Roxygen: list(markdown = TRUE)
VignetteBuilder: knitr

URL for the package (the development repository, not a stylized html page): https://github.com/tklebel/jstor
Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):

The package should fit very well into the data extraction-category, because it extracts data from *.xml-files for further use.

Who is the target audience and what are scientific applications of this package?

The target audience would be scientists leveraging JSTOR/DfR for textual research. Currently there is is no implementation in R to parse the meta data they deliver. Although the data quality on references is not very consistent, one could try to conduct citation analysis with a lot of control over the sample and the data compared to GoogleScholar or Web of Science.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

No.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Pre-submission was approved by @karthik: https://github.com/ropensci/onboarding/issues/186

Requirements

Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage, using services such as Travis CI, Coveralls and/or CodeCov.
[x] I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

[x] Do you intend for this package to go on CRAN?
[x] Do you wish to automatically submit to the Journal of Open Source Software? If so:
- [x] The package has an obvious research application according to JOSS's definition.
- [x] The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
- [x] The package is deposited in a long-term repository with the DOI: 10.5281/zenodo.1169862
- (Do not submit your package separately to JOSS)
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
- [ ] The package is novel and will be of interest to the broad readership of the journal.
- [ ] The manuscript describing the package is no longer than 3000 words.
- [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal.
- (Please do not submit your package separately to Methods in Ecology and Evolution)

I would very much like to submit a paper to JOSS, however at the moment I don't have a proper case-study yet. If possible, I would like to submit a paper at a later stage.

Detail

[x] Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:
[x] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

The only "problems" from goodpractice::gp() are a test coverage of 98% and lengthy lines in the test-files.

If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

If he would find the time to do it, a technical review by @jimhester could probably help in resolving some performance issues, although he might urge me to re-write the whole thing (again).

karthik commented 6 years ago

👋 @tklebel

Thank you for this submission. During your pre-submission review I completely missed that we had a similar submission a year ago https://github.com/ropensci/onboarding/issues/86 by @benmarwick

Since that submission was stalled, we are good to proceed with yours. I checked in with Ben on this, and his suggestion was for you (and reviewers) to look through his work to see if there are ideas that might help your effort. So please take a look at https://github.com/benmarwick/JSTORr

Editor checks:

[x] Fit: The package meets criteria for fit and overlap
[x] Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
[x] License: The package has a CRAN or OSI accepted license
[x] Repository: The repository link resolves correctly
[ ] Archive (JOSS only, may be post-review): The repository DOI resolves correctly
[ ] Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

You are already aware of the goodpractice results, but here they are anyway.

── GP jstor ────────────────────────────────────────────────────────────────────

It is good practice to

  ✖ write unit tests for all functions, and all package code
    in general. 97% of code lines are covered by test cases.

    R/batch_import_fun.R:21:NA
    R/batch_import_fun.R:22:NA
    R/batch_import_fun.R:23:NA
    R/example.R:15:NA
    R/example.R:16:NA
    ... and 3 more lines

  ✖ avoid long code lines, it is bad for readability. Also,
    many people prefer editor windows that are about 80 characters
    wide. Try make your lines shorter than 80 characters

    tests/testthat/test-article.R:75:1
    tests/testthat/test-author-import.R:35:1
    tests/testthat/test-author-import.R:36:1
    tests/testthat/test-author-import.R:40:1
    tests/testthat/test-author-import.R:41:1
    ... and 19 more lines

  ✖ fix this R CMD check WARNING: LaTeX errors when creating
    PDF version. This typically indicates Rd problems.
  ✖ fix this R CMD check ERROR: Re-running with no
    redirection of stdout/stderr. Hmm ... looks like a package You may
    want to clean up by 'rm -rf /tmp/Rtmpq9Jt37/Rd2pdf1b207a3eca4b'
────────────────────────────────────────────────────────────────────────────────

I would very much like to submit a paper to JOSS, however at the moment I don't have a proper case-study yet. If possible, I would like to submit a paper at a later stage.

Not a problem. There is no requirement that both be timed together. JOSS submission can be made independently at anytime.

Reviewer 1: @jsonbecker Due date: Feb 23, 2018

Reviewer 2: @rileymsmith19 Due date: March 1

tklebel commented 6 years ago

Thanks @karthik!

I cannot reproduce your errors and warnings from R CMD check, neither on travis, nor on my machine, nor Appveyor or the win-builder. Could you help me in debugging, or should we ignore them?

You put my submission into topic:data-publication, whereas I initially thought that topic:data-extraction would be more suitable. Did you change your opinion?

Before my pre-submission, I look into @benmarwick s package, and read the review by @kbenoit thoroughly. I am reluctant to add similar features like which are currently implemented in JSTORr, because the same critique would apply. In general, I think it would be good to keep jstor rather simple and in a way agnostic to what the user would do with data from JSTOR (analyze n-grams, conduct topic-modeling, etc.). Nevertheless, it would probably be beneficial to show a bit more clearly what can be done with jstor. I am therefore working on a simple case study to illustrate a typical use-case, which includes filtering of meta-data in combination with the analysis of n-grams.

karthik commented 6 years ago

Thanks for catching my mistake with the wrong tag. I have fixed it.

Re. the issues from R CMD CHECK, I'll look again when I have a bit of time, but let's proceed since the reviewers will also be able to see if it's a replicable issue.

For long lines, you can add # nolint at the end of those lines https://github.com/jimhester/lintr#project-configuration this way lintr will not see them.

And lastly, I appreciate that you have looked into Ben's work and the relevant feedback on the package. I agree that it would be very helpful to include some use cases in your package.

tklebel commented 6 years ago

I added # nolint at the relevant lines, so lintr ignores them now properly.

In the meantime I wrote an initial version for a case-study which is live here: https://tklebel.github.io/jstor/articles/analysing-n-grams.html In order to host this case-study which cannot be included as a vignette, I went ahead and added pkgdown.

Finally, I somehow mistook the JOSS for the JSS. Since JOSS only requires a very short introductory paper, I included one in jstor and would like to submit to JOSS via the fast-track option, in case the review goes well.

karthik commented 6 years ago

Both reviewers are now assigned. Thanks @rileymsmith19 and @jsonbecker 🙏!

tklebel commented 6 years ago

Since I made a few improvements over the last few days, I created a new version (0.2.6) and made a snapshot at Zenodo. I updated my initial post accordingly and will try to avoid any further changes until the reviews are in.

maelle commented 6 years ago

I think the correct Github username for @rileymsmith19 is @EccRiley 😺

EccRiley commented 6 years ago

You are correct, @maelle, Thanks! Sorry for any confusion, @karthik !

jsonbecker commented 6 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[X] Vignette(s) demonstrating major functionality that runs successfully locally

Personal preference here-- in the ToC the Introduction vignette, the use of backticks to style the section headers resulted in links that don't look like links. I think that my preference would be to not have the fixed-width fonts for the functions in the headers such that the ToC is more clearly displaced as links.

When reading the description of find_article, find_authors, etc I don't have confidence about which should return more than one response. I think based on the pluralization, the expectation is that the XML always contains one article, which may have multiple authors, multiple references, and multiple footnotes. This is clearer in reading all of the examples, but if so, I think it'd be worth specifying whether the base XML will always have one article or more than one article up front. Additionally, I find the kable output to be appealing in the vignette and the swap from find_article and find_authors which use that output to just printing tables in the output is a little jarring for me. I'd probaby just use kable in all the examples.

[X] Function Documentation: for all exported functions in R help
[X] Examples for all exported functions in R Help that run successfully locally
[X] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[X] Installation: Installation succeeds as documented.
[O] Functionality: Any functional claims of the software been confirmed.

The functionality works with the supplied examples. Because the data set is not freely available though, it's hard to confirm with an authentic case. I have relatively high confidence this will not be a challenge based on the structured nature of the data set and clear code here.
[O] Performance: Any performance claims of the software been confirmed.

The functions work well and as described with the minimal data examples. There was some note about a desire for a technical review for speeding things up. I might suggest that if such a task was undertaken, it may make more sense to explore whether loading data into a sqlite database might be preferable to working with many XML files. It wouold be a pretty different package, but I can imagine a scenario where the primary goal of this package is loading the data into sqlite and then some convenience functions to access that data rather than from XML straight to tibbles. In my opinion, the fractured nature of the many XML files and atomiticity of the contents of those files suggest that I would want this data in a database for repeated access and not have to crawl the file system. I wonder if the data is too large for sqlite though under realistic loads. In that case, it would be more difficult to justify a package around the database load process, although I'd probably still pursue that if I were working with this package with my own database.
[X] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.

The test coverage is solid and the tests are clear. The author may want to consider doing more tests that look at whether the resulting tibble is as expected as opposed to the more extensive element-level checks in some tests, e.g. more like test-author-import.R and less like test-article.R.
[X] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

2 -2.5hrs

Review Comments

I am not sure if I would name the first column response in find_* functions basename. Since this is intendended to be a file path or file name, I think it should be named file_name or similar to be more desciptive.
I am not sure that I would be using the basename_id as the primary matching agent. I guess since articles do not consistently have a doi, pub_id, and jcode, you are trying to solve the problem by relating back to the file name. In my opinion, if the article_doi, article_pub_id, and article_jcode are a set where almost always only one value exists, then I would probably structure my data to have an article_unique_id and article_id_type or similar, based on gathering those three values. I'd far rather match on an externally valid identifier across files and other data than rely on my file path to the data. In that scenario, I'd think the file path is then an optional field really.
I find it a little surprising that books don't contain some of the data available for articles, like footnotes and references. I somewhat expected to see equivalent functions across books and articles. I wonder if it would be valuable to have find_authors work for both books and articles. You already have functions to check if something is an article or a book, so why not have find_authors with a parameter for book or article. You can check for its presence, and if it's not user supplied, use your validate_* helpers to determine the type, then for book you can call find_chapters(authors = TRUE) internally and for article continue on the existing code path? Combined with my feedback above about the article identifiers, you could then return the same tibble structure for books or articles when calling find_authors, which seems convenient to me. This would also potentially avoid the unnest call by adding the author_number column to books like with articles. Finally, for things like references and footnotes, it'd be good for the validate_article failure to say that footnotes and references are not available for books, rather than the more generic "you're using an article function for a book" style message.
I see now by reading through the analyzing n-gram the motivation for having the file path in use for grabbing the redduced ngram2 text from this data set. While I think this motivates making file path an available piece of information, I still think my earlier comment about not using it as the core identifier stands to some degree. I did find this vignette to be helpful as someone not familiar with this data set.
Given the file directory structure that comes from DfR requests, does it make sense to mimic that for test data as well? Are there support functions or processes used with the ngram2 files that you would find helpful to add as additional functions with this package?
What parts of this data may make sense for loading into a sqlite database? The choice to go from XML to CSV is a great lowest denominator, but I wonder if there isn't greater advantage to formalizing the data in sqlite. Maintaining a large number of files in a specific directory structure, even after transitioning from XML to CSV poses some concer to me, especially with the file name being a major identifier for joins.
If CSV is the appropriate target for some of this work, I wonder if it may be worthwhile to specify classes prior to write_csv to have the best likelihood of reloading that data and not losing type info in the serialization/deserialization process. Similarly, should their be import functions that can pull the CSV output of jstor_import into R with some assertions/checks?

tklebel commented 6 years ago

Thanks @jsonbecker for the thorough review! I'm eager to respond to the more substantial questions (csv vs sqlite and file_name vs other_identifier), but I think I will wait for the review by @EccRiley, so I can address all comments in one go.

jsonbecker commented 6 years ago

Absolutely. I plan to add a few more overall thoughts but with deadline looming I really wanted to get it in! I realize my review heavily focused on a suggestion that may have limited value and is a bit of my own preferences, but that’s largely because everything is otherwise in great shape!

http://www.jsonbecker.com/contact/

On Feb 21, 2018, at 3:53 AM, Thomas Klebel notifications@github.com wrote:

Thanks @jsonbecker for the thorough review! I'm eager to respond to the more substantial questions (csv vs sqlite and file_name vs other_identifier), but I think I will wait for the review by @EccRiley, so I can address all comments in one go.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

tklebel commented 6 years ago

I'm still eager to work on/discuss some changes, but wanted to wait for the second review. Do you happen to know until when you will be able to finish your review, @EccRiley?

sckott commented 6 years ago

@eccriley are you still able to do the review?

karthik commented 6 years ago

@tklebel Sorry for the delays. Looks like @EccRiley hasn't responded a while. I am seeking another reviewer to replace her.

tklebel commented 6 years ago

@karthik no worries. @elinw seems to have been trying out the package, maybe she would be willing to do a review.

elinw commented 6 years ago

I have been, yes. I could do a review. I also have some data if @jasonbecker would like to try it.

karthik commented 6 years ago

@elinw Fantastic! Thank you for agreeing to review. Assigning @elinw as second reviewer

karthik commented 6 years ago

@elinw Gentle ping checking in on your review. 🙏

elinw commented 6 years ago

@karthik On it, should finish it up this weekend.

elinw commented 6 years ago

This is still a draft

Package Review

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[ ] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

JSTOR is a really important data source for the study of scholarly publication on issues. JSTOR as a data source has wide coverage over many disciplines and also includes sources going back in time. As a data source JSTOR has some challenges and features that are a bit different than others. First, there is not an API in the form of a REST interface. Instead, researchers fill out a web form and request the results of particular queries. Then they are emailed a zip file containing potentially thousands (or more) of individual files including the xml data for the item and the n-gram data for that item in separate files for unigrams, bigrams and trigrams (assuming all of these were requested). These are organized into a deeply nested file structure that ends with a folder for each item (article or book). Obtaining full text data is a separate process involving a user agreement. While not fundamentally different than working with a web API that returns xml, it definitely feels different to the end user.

All of this to say that it is extremely helpful to have a package to help with managing this. Specifically the jstor package gets the data read in from all of the files into csv files separated by the type of content (articles, authors, references, footnotes). It then writes to CSV files. I think this package will encourage people to use this data source more.

The package itself is solid, test coverage is good, documentation is complete, and good practices are followed. It does exactly what it says and works as described. The code seemed fine, and in particular the ways that it it made batch size and number of cores easy to specify was helpful to me. I appreciated the single interface to all the data types.

The README is helpful in defining the structure and consistency of the data.

Comments

These are some additional comments for consideration.

I have a JSTOR basic file (not full text) from my research that I used in my testing. It contains 1678 items, 42051 references, 12376335 bigrams. I did my testing mainly on RStudioServer (open source) installed on CentOS with 4 cores, although I also did some on my macbook air. I will say that I was happy to have a package that installed easily.

At first I was surprised that the data were read to csv files. However, after some thought I feel it is very useful to have this as the main outcome as it is then up to the user to decide what kind of storage to use for further processing. I found it useful to have the stored files to return back to. For me, simply having a way to approach the mass of individual item files and get them into rectangles was a huge win, especially having the processing issues handled (meaning otherwise I would have probably just created a much less smart script for reading the files recursively).

One thing I will mention is that I did find that for my data I could not use read_csv() without generating warnings, but read.csv() worked fine. It might be helpful to include a discussion of storage options for the rectangular data. I also don't think this package requires documentation of all aspects of a text analysis What is more important is to focus on preparing the data for clean importing into other text analysis packages and the storage options they support. I say that, but I followed the instructions in the bigrams/tidy text vignette pretty exactly and found them incredibly helpful. The JSTOR data (basic files) is not really like the examples in the TidyText book or most other documentation for text analysis since the full texts are not there (and the n grams are already created). So I won't discourage doing more of this as you go forward with maintaining the package, especially for known quirks of JSTOR (see below). I also appreciated that there was a focus on getting a simple set of descriptive graphs.

The vignettes are helpful and the new one in particular is a nice overview of the process of cleaning and analyzing the ngram data. However, I think it would be stronger if it discussed some of the issues specific to the JSTOR data. For example, in working with the package I ended up creating a list of what I think of as my JSTOR stop words because academic writing is a specialized language.

For example for the references data I have:

stop_ref1 <- c("", "References", "Notes", "Note", "Ibid.", "Bibliography", 
               "Loc. cit.", "Id.", "Foonotes", "Ibid.", "Endnotes", 
               "[Bibliography]", "Notes and References", "Works Cited",
               "Footnotes","Ibid", "[References]", "Statutes Cited",
               "Literatur", "Case Cited", "Footnote", "Ibid", 
               "Cases Cited", "[Endnotes]", "References", "Reference"
               )

And for the bigrams I had this.

custom_stop_words <- c("ibid", "supra", "pp", "eg", "alia", "hoc", 
                       "e.g.")

(It is very easy to add custom stop words, I could figure out how just from following the vignette and realizing I could add my vectors to those in Tidytext). It could be useful to provide a more complete version those in some way.

I also did not necessarily want to get rid of all numbers but one kind of troublesome number was e.g. 1998b. So it would probably be easy to write a function for that, though I ended up just listing out the ones I found.

Another example, with JSTOR unless you specify otherwise, you get articles in multiple languages. The language names are not consistently coded so I used dplyr::filter(language == "en" | language == "eng"). There are various things like that which you might want to mention.

I do not have a problem with using the baseid as the main ID. This potentially lets me work with other interfaces to JSTOR, is consistently produced over time and is unique. If I have data from separate JSTOR queries it lets me find overlap easily.

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 5

Review Comments

elinw commented 6 years ago

Oh I meant to mention that the vignette might want to mention that installing igraph can be more complex than installing a normal package.

elinw commented 6 years ago

Okay one more thing about jstor idiosyncrasies. This is something I realized today (and relates to our prior discussion about end notes turning up in references). Reference lists/bibliographies have each source once, while endnotes may repeat the same source many times. Therefore simply running a frequency on the sources will give you distorted results (I had a surprising article turn up as most cited, but it turned out that it was one article with end notes that cited it multiple times). So I think that if you were willing, you could start a vignette that addresses some of these things and let users contribute strategies and code snippets for handling. Or you could write it yourself since you have a lot of experience, but I could see that getting a bit overwhelming unless you decide to write an article about this.

tklebel commented 6 years ago

Response to reviews

Thank you @jsonbecker and @elinw for your review and the helpful feedback! Although both of you mention, that your reviews are not finished, I will start working on your comments, so we can finish the onboarding hopefully until useR! (where I'll give a presentation about the package).

There are a couple requests for changes, and many thoughts for further consideration from you. I will outline my plan to address all issues in this post, and then work through them over the next two weeks.

From your feedback, I can distill three main areas of concern:

How data are exported after they have been brought into rectangular form, and how these data can later be re-imported.
The issue of using "basename_id" as the unique identifier.
Known quirks which are inherent to data from JSTOR/DfR and how to deal with them.

I will discuss each of the topics, and then proceed to the remaining comments.

Data export and re-import

Depending on the size of the dataset under consideration, it might be worthwile putting the converted data into some sort of SQL-database. The package was actually designed to help with this. I separated the extraction functions in a way, so space on disk could be saved, because metadata would not have to be repeated for each reference the article has, etc. So the initial desing already planned for this structure in separate tables, which could be easily joined with dplyr. As you both agree, exporting to .csv might be a good lowest denominator. However I would probably prefer not to implement the capability to write directly to sqlite in jstor_import, since I find it difficult to get the balance right about how much control users shoud have: if there is the option to write to sqlite, should there be options for creating new tables, appending, overwriting, etc.?

That being said, I have been busy last week at implementing a new function (jst_import_zip), which lets you import files directly from the zip-archive. Users can specify which parts they want to be converted via which functions (like article = c(find_article, find_references), ngram1 = jst_read_ngram). I hope, this function will make it a lot easier to deal with the multiple different sources which are available from JSTOR, and it removes the need to extract the zip (which saves quite some disk space). The current output of this function are multiple csv-files with different file names refering to the used functions. I could very well imagine adding the option to write the output in a single sqlite-db. This would have the elegance of going from one zip-file directly to one .sqlite-file, without the hassle of any intermediate files (I don't know how I would solve the case where you have multiple zip-archives, because the data were to large for one archive (they split up special requests into files of max 1.5GB)). I would be happy about feedback on this idea! Since the reviews were initially for a version without jst_import_zip, I would prefer to keep the scope and consider jst_import_zip and related issues as an area for future improvements.

Regarding the further use of exported .csv-files, I could imagine two helper functions:

One to combine multiple output files into one file. When importing metadata or references for large datasets, it makes sense to use batches, since you might not know, how big the resulting file will be. After conversion via jstor_import, you will most likely want to combine all batches into a single file, and this could be done via a helper function. As a name, I could imagine jst_combine_outputs. I would probaly opt for a "dumb" function which accepts paths as input, and simply imports, combines, and exports the content.
A simple helper function to import files which were exported from jstor_import. I could try to determine the type of content from the number, or, when available, name of columns: find_article hast 17 cols, find_book has 13, find_chapters has 9, find_authors has 6, and so forth. Other than the names of columns (if they are present in the .csv) or the number of columns I don't know how I would determine, what content the file has and how to read it. All this is only to avoid using read_csv(., guess_max = 3000) or similar, since the warnings mainly come from read_csv not guessing the column_types correctly.

basename_id as unique identifier

I would agree with @elinw, that using the name of the files as unique identifier is ok ;) From my understanding, it is the only unique attribute of those files, which is the same for all articles. Articles may have up to three different types of ids besides the filename, and simply combining them might lead to unexpected issues. The filename is most likely consistent and allows to match metadata with ngrams/full-text-data. However, I like the suggestion for renaming it to file_name. This is more straightforward.

Known quirks about the data

I agree with @elinw, that there are many quirks with data from DfR, some of which we know about, some of which we do not. I like the idea of having a vignette that collects known quirks and how to deal with them. There are currently a few helpers in the source which are not exported, like find_total_pages, which simply calculates the total number of pages from first and last page. This is helpful, but may lead to errors, when the page range for an article is something like: 50-70; 310, which simply means, the body of the article is from 50-70, and there is a correction/erratum on page 310. Currently, calculating the total number of pages would result in 261 pages, since 50 is the first, and 310 the last page. There are many more quirks, and I cannot claim to have a good grasp over their majority. Especially, since I only worked with one very specific dataset, I might have found a few quirks in my data, but there might be other problems in other journals or for other applications. Therefore starting a vignette, and letting users contribute their knowledge (if they are willing to) sounds like a great idea to me.

Issues with a simple answer

find_authors works for books too.
find_footnotes and find_references don't work for books, since I have not seen an example of a xml-file from a book containing data about footnotes or references.

Tasks to undertake

[x] Improve formatting of vignette by using knitr::kable() and formatting headings in normal font.
[x] Make error about references/footnotes not being available for books more specific.
[x] Clarify the the structure of data in intro vignette: xml-files only contain one article, one book (with multiple chapters), etc.
[x] Create helpers to re-import data after conversion with jstor_import.
[x] Write first version of vignette about known quirks.

Other issues

Multiple endnotes

@elinw could you point me to the article with multiple endnots so I can take a look?

Function naming

I thought about changing names for functions across the package. After I posted the review, I had a break from the package for about 1 month. Coming back, I could not remember the names of the functions for a quick analysis :/ I'd like to at least add a prefix to all functions or change the current prefix. The tidyverse styleguide encourages the use of verbs, that's why I had find_*. Unfortunately the verb "find" might not be obvious in this context. A simple suggestion could be like the following:

find_article -> jst_article
find_authors -> jst_authors
...
jstor_import -> jst_import
jst_import_zip stays the same

I opted for "jst", since "js" might be similar to javascript, and the default for RStudio autocomplete is to display after three characters anyways.

What do you think about the proposed naming change?

Please let me know if you agree with my roadmap or in case I overlooked an issue you raised.

jsonbecker commented 6 years ago

I would be happy with the renaming to jst_*, but I might still prefer "verbing" to suggest the functionality, e.g. jst_get_article. I wouldn't hold up a 👍 for that. I am satisfied given more context with the move toward file_name and using that as an identifier. I guess I can't wish JSTOR toward having a better data model ;).

I'm also good with database being out of scope.

tklebel commented 6 years ago

It took some time, but I think I am finished with the requested changes. Due to the long duration of the review process, there are many changes altogether, and only some are responses to the reviewers comments.

In my post above I outlined five areas where I would make changes:

[x] Improve formatting of vignette by using knitr::kable() and formatting headings in normal font.
[x] Make error about references/footnotes not being available for books more specific.
[x] Clarify the the structure of data in intro vignette: xml-files only contain one article, one book (with multiple chapters), etc.
[x] Create helpers to re-import data after conversion with jstor_import.
[x] Write first version of vignette about known quirks.

All of them are done as of now. The news file, containing all changes, is as follows (or in a prettier version online here: https://tklebel.github.io/jstor/news/index.html):

jstor 0.3.0

Breaking changes

jst_import and jst_import_zip now use futures as a backend for parallel processing. This makes internals more compact and reduces dependencies. Furthermore this reduces the number of arguments, since the argument cores has been removed. By default, the functions run sequentially. If you want them to execute in parallel, use futures:

library(future)
plan(multiprocess)

jst_import_zip("zip-archive.zip",
               import_spec = jst_define_import(article = jst_get_article),
               out_file = "outfile")

If you want to terminate the proceses, at least on *nix-systems you need to kill them manually (once again).

All functions have been renamed to use a unified naming scheme: jst_*. The former group of find_* functions is now called jst_get_*, as in jst_get_article(). The previous functions have been deprecated and will be removed before submission to CRAN.
The unique identifier for matching across files has been renamed to file_name, and the corresponding helper to get this file name from get_basename to jst_get_file_name.

Importing data directly from zip-files

There is a new set of functions which lets you directly import files from .zip-archives: jst_import_zip() and jst_define_import().

In the following example, we have a zip-archive from DfR and want to import metadata on books and articles. For all articles we want to apply jst_get_article() and jst_get_authors(), for books only jst_get_book(), and we want to read unigrams (ngram1).

First we specify what we want, and then we apply it to our zip-archive:

# specify definition
import_spec <- jst_define_import(article = c(jst_get_article, jst_get_authors),
                                 book = jst_get_book,
                                 ngram1 = jst_get_ngram)

# apply definition to archive
jst_import_zip("zip_archive.zip",
               import_spec = import_spec,
               out_file = "out_path")

If the archive contains also research reports, pamphlets or other ngrams, they will not be imported. We could however change our specification, if we wanted to import all kinds of ngrams (given that we originally requested them from DfR):

# import multiple forms of ngrams
import_spec <- jst_define_import(article = c(jst_get_article, jst_get_authors),
                                 book = jst_get_book,
                                 ngram1 = jst_get_ngram,
                                 ngram2 = jst_get_ngram,
                                 ngram3 = jst_get_ngram)

Note however that for larger archives, importing all ngrams takes a very long time. It is thus advisable to only import ngrams for articles which you want to analyse, i.e. most likely a subset of the initial request. The new function jst_subset_ngrams() helps you with this (see also the section on importing bigrams in the case study.

Before importing all files from a zip-archive, you can get a quick overview with jst_preview_zip().

New vignette

The new vignette("known-quirks") lists common problems with data from JSTOR/DfR. Contributions with further cases are welcome!

New functions

New function jst_get_journal_overview() supplies a tibble with contextual information about the journals in JSTOR.
New function jst_combine_outputs() applies jst_re_import() to a whole directory and lets you combine all related files in one go. It uses the file structure that jst_import() and jst_import_zip() provide as a heuristic: a filename with a dash and one or multiple digits at its end (filename-1.csv). All files with identical names (disregarding dash and digits) are combined into one file.
New function jst_re_import() lets you re_import a .csv file that jstor_import() or jst_import_zip() had exported. It tries to guess the type of content based on the column names or, if column names are not available, from the number of columns, raising a warning if guessing fails and reverting to a generic import.
A new function jst_subset_ngrams() lets you create a subset of ngram files within a zip-file which you can import with jst_get_ngram().
A new set of convenience functions for taking a few cleaning steps: jst_clean_page() tries to turn a character vector with pages into a numeric one, jst_unify_journal_id() merges different specifications of journals into one, jst_add_total_pages() adds a total count of pages per article, and jst_augment() calls all three functions to clean the data set in one go.

Minor changes

Improved documentation regarding endnotes (thanks @elinw)
jstor_import has a new argument: n_batches which lets you specify the number of batches directly

I don't expect you to dive into all the changes, though. I hope that all your comments are addressed with the changes and that we can finish the onboarding soonish 😄

elinw commented 6 years ago

I'm sorry I'm slow on replying but this seems good. I love the import from zip. I have some things to focus on right now but by the end of the week I'm planning to have updated to the current development branch and try things out. I will also get you an example article.

jsonbecker commented 6 years ago

I am very happy with these changes. I especially like the set and structure of the vignettes and the separation of the "case study". This now serves not only as a well documented package but as solid documentation of the data it works with. This has my full 👍

elinw commented 6 years ago

25074331
basename_id journal_doi journal_jcode journal_pub_id article_doi article_pub_id article_jcode article_type 1 25074331 NA jbusiethi 25074331 research-article article_title volume issue language pub_day 1 Ethics Education in the Workplace: An Effective Tool to Combat Employee Theft 26 2 eng 1 pub_month pub_year first_page last_page 1 7 2000 89 100 Has 21646 Greengard, Samuel: April 1993, 'Theft Control\nStarts with HR Strategies', Personnel Journal, p. 85. 21647 Greengard, Samuel: April 1993, 'Theft Control\nStarts with HR Strategies', Personnel Journal, p. 85. 21648 Greengard, Samuel: April 1993, 'Theft Control\nStarts with HR Strategies', Personnel Journal, p. 85. 21649 Greengard, Samuel: April 1993, 'Theft Control\nStarts with HR Strategies', Personnel Journal, p. 85. 21650 Greengard, Samuel: April 1993, 'Theft Control\nStarts with HR Strategies', Personnel Journal, p. 85.

I actually pulled the article and it's true, they cited the same page multiple times. It was striking because it made Greengard the most cited author but I had never heard of him.

tklebel commented 6 years ago

Thanks @elinw for the example. I added a few sentences about issues with references in the vignette and linked the article as an example.

elinw commented 6 years ago

I actually don't think that the article is "wrong" it's just following the citation style of the journal it is published in. Other journals probably would have used Ibid and op. cit. or similar.

tklebel commented 6 years ago

@elinw I changed my language a bit to remove the certainty of the judgement, though I would still consider it to be an artifact since they use Ibid too, but only for some references. However there might be another reason for this which I am not aware of. I therefore rephrased like this: "[the article cites ...] which is possibly an artifact."

tklebel commented 6 years ago

@karthik just pinging if you could take a quick look at this issue. The current status seems to be:

We are in phase 5/awaiting-reviewer(s)-response.
@jsonbecker has approved the package.
It seems like @elinw wanted to take another look at the current dev branch, but I am not entirely sure.

How will this onboarding proceed?

karthik commented 6 years ago

Hi @tklebel Apologies for the delay. I'm just waiting on @elinw to sign off and with that we can finish up. Elin: Do you have any further unresolved issue with the master release of this effort? If not could you indicate sign off so we can proceed with acceptance? 🙏

elinw commented 6 years ago

Yes, I am good with these changes, the last of which are mainly just polishing documentation etc. This is going to be really useful and encourage the use of JfR. I noticed the change from stop() to abort() and read about the differences, my suggestion is that you make that rlang::abort() explicitly, which you'll need to do for CRAN anyway.

Great job!

tklebel commented 6 years ago

@elinw I'm actually not sure if I really prefer abort over stop in general. I used it to remove the calls to .call = FALSE. But since I import it via importFrom(rlang,abort) in NAMESPACE, this should be fine for CRAN, right? If not, I will of course change it before submitting.

Apart from that, I am very happy for the thumbs up! This comes just in time before useR! 😃

karthik commented 6 years ago

Congrats @tklebel , your submission has been approved! 🎉 Thank you for submitting and @elinw and @jsonbecker for thorough and timely reviews. To-dos:

[ ] Transfer the repo to the rOpenSci organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You'll be made admin once you do.
[ ] Complete final suggestions from elin, especially to get ready for CRAN
[ ] Add the rOpenSci footer to the bottom of your README

[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)

[ ] Fix any links in badges for CI and coverage to point to the ropensci URL. (We'll turn on the services on our end as needed)

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/technotes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). If you are, @stefaniebutland will be in touch about content and timing.

tklebel commented 6 years ago

Great! Thank you @karthik for managing the application, and @jsonbecker and @elinw for your helpful comments!

I transferred the repo, added the rOpenSci footer and changed all relevant links in the README.

I'll see to make the package ready for CRAN and I will complete the suggestion by @elinw while doing so.

I'd very much like to write a blog post with some narrative about the development of the package. @stefaniebutland, I should have time to write this post some time after July.

tklebel commented 6 years ago

@karthik I changed more or less all links I found, build systems (travis and appveyor) are working. The only missing part as far as I can see are admin rights for the repo, so I can update a few remaining things (mainly the link to the pkgdown site). Then we could close the issue, I think.

stefaniebutland commented 6 years ago

I'd very much like to write a blog post with some narrative about the development of the package. @stefaniebutland, I should have time to write this post some time after July.

Very glad to hear that @tklebel. This link will give you many examples of blog posts by authors of onboarded packages so you can get an idea of the style and length you prefer: https://ropensci.org/tags/review/.

Here are some technical and editorial guidelines for contributing a blog post: https://github.com/ropensci/roweb2#contributing-a-blog-post. I'll mark my calendar to follow up with you in mid- August so we can agree on a publication date. We ask that you submit your draft post via pull request a week before the planned publication date so we can provide feedback.

Happy to answer any questions as they come up.

stefaniebutland commented 6 years ago

@tklebel I have blog post publication dates available Sept 18, Oct 2, 9, 16. If you're still interested, let me know your preferred date and mark your calendar to submit a draft a week in advance.

Please see my comment above for more details.

tklebel commented 6 years ago

@stefaniebutland I'm still interested in writing a blog post. Regarding publication dates, I would prefer Oct the 9th, if that is ok for you.

stefaniebutland commented 6 years ago

Glad to hear that @tklebel. Tuesday 2018-10-09 is yours. Please submit a draft via pull request by 2018-10-02 and don't hesitate to ask any questions here.

tklebel commented 6 years ago

@karthik is it possible, that I am still missing appropriate rights for the repo? I wanted to fix the link in the title of the repo, but I still couldn't do it (someone else fixed it for me in the meantime).

karthik commented 6 years ago

@tklebel Can you please check now?

tklebel commented 6 years ago

@karthik perfect, thanks!

stefaniebutland commented 5 years ago

Hi @tklebel. Checking to see if you're still on track to submit a draft bog post by 2018-10-02 for publication the following week.

tklebel commented 5 years ago

Hi @stefaniebutland. Is it ok if the post is more about why and how I wrote the package, including what I learned along the way, and less about how to use it? I have a lot of documentation, including a case study on the pkgdown site, so I don't think it would be helpful and interesting to write something similar again...

If you are fine with that, than I would answer, that I am on track 😺

stefaniebutland commented 5 years ago

Sounds great @tklebel. More people will likely relate to this kind of post. Please include a short section noting & linking to the docs you have.

ropensci / software-review

Submission: jstor #189

Summary

Requirements

Publication options

Detail

Editor checks:

Editor comments

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Comments

Final approval (post-review)

Estimated hours spent reviewing: 5

Review Comments

Response to reviews

Data export and re-import

basename_id as unique identifier

Known quirks about the data

Issues with a simple answer

Tasks to undertake

Other issues

Multiple endnotes

Function naming

jstor 0.3.0

Breaking changes

Importing data directly from zip-files

New vignette

New functions

Minor changes