ropensci / software-review

rOpenSci Software Peer Review.
291 stars 104 forks source link

refimpact submission #78

Closed perrystephenson closed 7 years ago

perrystephenson commented 8 years ago

Summary

Provides wrapper functions around the UK Research Excellence Framework 2014 Impact Case Studies Database API http://impact.ref.ac.uk/. The database contains relevant publication and research metadata about each case study as well as several paragraphs of text from the case study submissions. Case studies in the database are licenced under a CC-BY 4.0 licence http://creativecommons.org/licenses/by/4.0/legalcode.

Package: refimpact
Title: API Wrapper for the UK REF 2014 Impact Case Studies Database
Version: 0.1.0
Authors@R: person("Perry", "Stephenson",
    email = "perry.stephenson+cran@gmail.com",
    role = c("aut", "cre"))
Description: Provides wrapper functions around the UK Research
    Excellence Framework 2014 Impact Case Studies Database API
    <http://impact.ref.ac.uk/>. The database contains relevant publication and
    research metadata about each case study as well as several paragraphs of
    text from the case study submissions. Case studies in the database are
    licenced under a CC-BY 4.0 licence
    <http://creativecommons.org/licenses/by/4.0/legalcode>.
License: MIT + file LICENSE
URL: https://github.com/perrystephenson/refimpact
BugReports: https://github.com/perrystephenson/refimpact/issues
Encoding: UTF-8
LazyData: true
Depends:
    R (>= 3.2.5)
Imports:
    curl (>= 2.1),
    jsonlite (>= 1.1),
    tibble (>= 1.2)
RoxygenNote: 5.0.1
Suggests: testthat

https://github.com/perrystephenson/refimpact

People interested in the following:

  1. Research in the UK and it's global impact
  2. Relationships between academic institutions
  3. Relationships between impactful publications
  4. A high quality and homogenous text resource for text mining
    • Are there other R packages that accomplish the same thing? If so, what is different about yours?

There are no other packages for accessing this data, as far as I am aware.

Requirements

Confirm each of the following by checking the box. This package:

sckott commented 8 years ago

Editor checks:


Editor comments

Thanks for your submission @perrystephenson

In assessing fit, I have a few questions:

Update 2016-10-07:

Will be seeking reviewers, but in the meantime. I ran goodpractice::gp() on your package. Here are the results, which it may be helpful to get started with before review (you can run it yourself and dig into the results more):

It is good practice to

  ✖ write unit tests for all functions, and all package code
    in general. At least some lines are not covered in this package.

    R/get_tag_values.R:25:NA

  ✖ use '<-' for assignment instead of '='. '<-' is the
    standard, and R users and developers are used it and it is easier
    to read your code for them if you use '<-'.

    R/search_case_studies.R:46:11
    R/search_case_studies.R:46:23
    R/search_case_studies.R:46:36
    R/search_case_studies.R:46:51

  ✖ avoid long code lines, it is bad for readability. Also,
    many people prefer editor windows that are about 80 characters
    wide. Try make your lines shorter than 80 characters

    R/get_institutions.R:18:1
    R/get_tag_values.R:21:1
    R/get_tag_values.R:25:1
    R/get_tag_values.R:29:1
    R/get_units_of_assessment.R:17:1
    ... and 15 more lines

  ✖ fix this R CMD check error/warning/note: Namespace in
    Imports field not imported from: ‘curl’ All declared Imports should
    be used.

Reviewers: @njahn82 @kbenoit Due date: 2016-11-01

perrystephenson commented 8 years ago

Hi @sckott

Thanks for getting back to me so quickly! I'll reply in order:

  1. It has been used predominantly in text mining research projects, an example is here. We're also working on it at the University of Technology, Sydney (UTS) with a view to building tools which can help researchers improve the quality of their academic writing by making it "more like" an existing body of known high quality text. This is related to our work on the Academic Writing Analysis project.
  2. I have looked around previously for an answer to this question, and everything I have read seems to imply that the data will be around for a while. I suspect that they will build up the dataset over time, which will mean adding the 2019 submission data when it becomes available.
  3. I am not aware of similar datasets in other countries (unfortunately). If my project proves successful then perhaps UTS will lobby the Australian Research Council to release a similar dataset.
  4. The text available through the service is the full text of over 6,000 research impact case study submissions, as well as a few structured and semi-structured pieces of metadata. This is a large collection of similar documents, with consistent grammar and minimal spelling errors. I am planning to write a vignette for the next CRAN release, which will incorporate some of my work on the project mentioned in point 1.

Thanks!

sckott commented 8 years ago

Thanks, I'll consult with my colleagues and get back ASAP

sckott commented 8 years ago

@perrystephenson checked the fit box ✔️

Seeking reviewers now, see updated comment above about goodpractice

sckott commented 8 years ago

reviewers now assigned

sckott commented 7 years ago

@njahn82 @kbenoit sorry about the ping from the bot (message deleted, but you probably got an email). still working out the rough spots

kbenoit commented 7 years ago

No problem!

njahn82 commented 7 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Paper (for packages co-submitting to JOSS)

The package contains a paper.md with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly staing problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 3 hours


Review Comments

This package gives access to UK's 2014 Research Excellence Framework (REF) Impact Case Studies. To my knowledge, this is one of the first comprehensive efforts to share reports under an open content license, making it an interesting source for studying science, science communication and funding activities.

The package interfaces the REF2014 API. Although the interfaced database seems to contain work being completed, obtaining data from the API rather than as package data is very reasonable because CRAN limits the size of data files to reduce installation time. Besides, the API provides various query methods, which are accessible through the package. The main function of refimpact is get_case_studies()

The author did a nice job interfacing the API with jsonlite and documenting the functions. Using tibble is a great choice to encourage re-use of the fetched data within R. The package is also available via CRAN, so it has passed essential tests when sharing an R package. However, I feel that the package could benefit from some revisions in order to increase code quality and rOpenSci compliance.

Use httr to work with and manipulate http requests

I would strongly recommend using httr for building and accessing data from web APIs. httr provides useful functions to create API queries, and increases the reliability of packages when something breaks. This httr vignette is a great introduction about how to write an API package.

More specifically:

Provide a vignette

The package would benefit from a vignette, a long-form documentation of how to use the package. The README nicely summarises who might be interested in using refimpact, how to install and how to access the data. This could be a great start and continued with a demonstration of how to solve particular problems researchers are faced with when they study science and its funding. In particular, it would be nice to see how to work with refimpact output using other R packages written for text-mining purposes or network analysis. I also see a lot of potential when interfacing refimpact data with literature packages to get more detailed bibliographic records.

README

Functions

Contributing

R CMD check

There was one NOTE when running R CMD check locally:

* checking dependencies in R code ... NOTE
Namespace in Imports field not imported from: ‘curl’
  All declared Imports should be used.

In summary, the package is a very good start, but could be improved by better aligning the code basis and documentation to the rOpenSci guidelines.

Please let me know if something is unclear or doesn't make sense. I am happy to help!

sckott commented 7 years ago

@kbenoit - hey there, it's been 29 days, please get your review in soon, thanks :smiley_cat: (ropensci-bot)

perrystephenson commented 7 years ago

Thanks @njahn82, I appreciate the detailed feedback! I probably won't get a chance to make changes right away, but I'll make sure that all of your feedback is addressed in the next version of the package.

kbenoit commented 7 years ago

Sorry folks, I have been in transit and under the gun for a deadline that passed Oct 31. I will do my best to complete my review by 4 November (Friday).

sckott commented 7 years ago

thanks for the update @kbenoit

kbenoit commented 7 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Paper (for packages co-submitting to JOSS)

The package contains a paper.md with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly staing problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Review Comments

This is a very specific package aimed at one thing: making it easy for researchers to pull data from the 2014 UK Research Excellence Framework API, for subsequent analysis. The package wraps API calls detailed on the REF website and returns conveniently classed R objects, for direct analysis in R.

Installation

Installed easily, passed full check with --as-cran toggled on my system. (I did not experience the NOTE about curl described by the other reviewer.)

Tests

The test coverage looks quite impressive (98%), although many or all of these tests could be made much more robust. For instance, the single test in test_tag_types.R is:

test_that("get_tag_types() returns a tibble", {
  # skip_on_cran()
  expect_equal(dim(get_tag_types()),c(13,2))
})

which not only does not test whether the return is a tibble, but also only matches the dimensions known in advance. A more robust test might also compare the values returned to the known tags from the REF website, or that the return type is in fact a tibble class object.

This is more than just being picky, since all of the ability to make sure the package is actually working are done in the testthat tests: Every function's Examples section is not run. I realize it's difficult to get API calls to run in tests, but it might be possible to run some of them, but tag them as \donttest{} instead of \dontrun{}.

Adherence to community guidelines

Here I will suggest a few easy-to-make changes:

Adherence to ROpenSci packaging guidelines

Documentation

This is where the package is weakest, since the documentation is limited to the barest of documentation for the functions. Even those do not describe much about the context of the options or inputs: for instance get_case_studies() did not really explain what the ID is or why I might use it instead of a UKPRN. I suggest that this could be vastly expanded in a Details section and added as an overview to refimpact-package.Rd. I had to do considerable testing and reading from the UK REF website before I understood what were the inputs to the functions.

Of course a vignette is what this package is really crying out for, to show not only how to use it but why, including some analysis of the data to show the practical motivation for querying this data through an R package. This will also get around the problem of not being to run all of the examples live because they involve remote API calls.

Motivation

As someone who was actually an individual submitted in this exercise -- although not for a case study -- from one of the listed institutions, and personally involved in managing the staff submitted from my institution, I held the first impression that this would be a really cool package to have, for accessing this data. However the more I experimented with the package, the more I wondered why the API approach was needed. The data is static, so the only reasons not to package the data are copyright and size. Most of the information (except some of the case studies - see http://impact.ref.ac.uk/CaseStudies/Terms.aspx) is governed by a CC license, and so could easily be packaged as data. The only objection to size applies to the case studies themselves, but again, if the documentation or README.md had more on the motivation and/or documentation, I would have a better idea of just how large this is (and whether this size makes it something that is not better simply provided as a flattened large data.frame or "tibble").

The following static tables from the API are CC licensed and could easily be packaged as built-in objects:

This seems to gut the functions from the package, since it leaves only get_case_studies(), which might be appropriately handled through an API call. But here I suggest the package could really enhance value by adding data-handling functions that link the static data objects to the structure of what get_case_studies() returns, such as ways to flatten the lists that are elements of the return objects from that function. For instance, the return object from get_case_studies(ID = c(27,29)) is a 2 x 19 element tibble, but several of those columns (e.g. Continient) are variable length lists. Many users who are not experts in dissecting R objects are going to have trouble with the nesting of lists within data.frames.

In addition, by having the smaller objects as built-in data, the inputs to get_case_studies() can be checked for valid values, rather than relying on the API to reject a non-existent ID, for instance.

sckott commented 7 years ago

thanks very much for the review @kbenoit !

@perrystephenson both reviews in now. let us know if you have any questions. continue discussion here with me and reviewers.

perrystephenson commented 7 years ago

Thanks everyone for the considerable effort you have put in to reviewing my work - I can see why feedback is the first benefit listed in the onboarding README! I've never had such a thorough review of my code before, so I really appreciate all the advice.

I've created a whole bunch of issues in the refimpact repository, and I'll start working through them when I get a chance. I'll probably have a few questions (mainly around the benefits of a split data/API package) but I'll wait until I've had a chance to think about the issues more fully before I do.

Thanks again, and I'll let you know how I go!

kbenoit commented 7 years ago

@perrystephenson Happy to help further with the process.

sckott commented 7 years ago

@perrystephenson let me know if I can help the process move along

sckott commented 7 years ago

@perrystephenson let me know if I can help the process move along

perrystephenson commented 7 years ago

Hi Scott, sorry I've been a little derailed looking after a new puppy :) I'm planning to get back to this next week, and I'll hopefully have it ready for another review by the end of the month.

Sent from my iPhone

On 17 Jan 2017, at 7:17 am, Scott Chamberlain notifications@github.com wrote:

@perrystephenson let me know if I can help the process move along

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

sckott commented 7 years ago

ok

perrystephenson commented 7 years ago

I've missed the self-imposed end of January deadline, but I've made fairly significant process working through the issues raised above. You can see how I'm tracking against the issues here: https://github.com/perrystephenson/refimpact/issues.

sckott commented 7 years ago

thanks @perrystephenson for the update

perrystephenson commented 7 years ago

I've finished re-writing the package based on your excellent feedback - I don't think there is a single line of code remaining from the original package you reviewed! This has necessarily involved a breaking change to the function syntax (there is now only one user-facing function, named ref_get), so the version number has been incremented to v1.0.0.

I have a couple of notes - excuses if you will!

I am happy to go through the changes in more detail, but given that one of the critiques was that the package was hard to use for new users, I think it might be best if I leave you to your own devices with the documentation in the package. Hopefully you will find it far more helpful than before!

Happy to answer any questions, and make any more changes that you suggest. When you're happy with the package, I will tag the v1.0.0 release to align with NEWS.md.

Thanks again for all of your time!

maelle commented 7 years ago

@perrystephenson Regarding the encoding NOTE, see Julia Silge's post and her CRAN comments (her package was accepted with the NOTE)

sckott commented 7 years ago

@njahn82 @kbenoit if you have time soon, please let us know if you are happy with @perrystephenson changes. if you don't have time soon, also let me know.

kbenoit commented 7 years ago

I would not worry about the NOTE for the non-ASCII data. quanteda has had these since the beginning. Brian Ripley gave the beginnings of a grumble about it, once, but as long as you have

Encoding: UTF-8

in DESCRIPTION then don't worry about it.

kbenoit commented 7 years ago

Sorry it's taken a while to comment on the revision... I think the update is greatly improved. The vignette is a welcome addition and the tests significantly improved. As I mention above, don't worry about the UTF-8 note.

I still think you could include a lot of the smaller bits of data that are currently accessed through API calls, namely institutions, units_of_assessment, and tag_types. But it works well enough to query these too.

I also still think you could (easily!) add more description to the ?refimpact page.

Finally, there are some cool new features in roxygen2 6.0 for documentation that you could use, esp for documenting data and packages. Not at all important for the approval process but definitely worth checking out!

sckott commented 7 years ago

thanks @kbenoit !

@perrystephenson any thoughts on his comments above?

njahn82 commented 7 years ago

Hi @perrystephenson

Apologies for my late review of your thorough package revision. I am very happy to see that httr is now implemented for interacting with the web API and that both vignette and README guide users through the package.

Here are some some follow-up points after having time to look into your revision:

Ensure backward compatibility

As introduced, there is now only one function that can be used for calling the API. However, I am afraid that the removed functions are not marked as .Deprecated. This is important to ensure backward compatibility with the existing CRAN release. Hadley Wickhams R packages book contains a chapter describing how to deprecate functions:

http://r-pkgs.had.co.nz/release.html

Function input

I wonder if it would be possible to split up the query parameters, so that I can provide something like

ref_get("SearchCaseStudies", id = "2", tags = "dkkdk")

instead of a list object.

I would still not allow numeric input for identifying the studies. In my opinion, it is best practice that IDs should be of class character.

Simplify nested lists?

I second @kbenoit comment that learners and non-frequent users of R are often unfamiliar with nested lists. The vignette gives good guidance on how to select list elements. However, it would be great to have at least one parser that additionally returns a subset of data as a tidy data.frame. This parser could become a great time saver for users simply wishing to get an overview over the data and to export this selection.

Vignette and README

Vignette and README contribute greatly to a better understanding of the aim and the usage of this package. Especially, the benefits and limits of the API implementation are very well described. One API problem, no call for retrieving all datasets at once, is very well solved by showing how to extract the entire dataset. There is some potential to shorten the example by using lapply instead of a for-loop:

# load dev version
devtools::install_github("perrystephenson/refimpact")
#> Skipping install of 'refimpact' from a github remote, the SHA1 (a1ccaafb) has not changed since last install.
#>   Use `force = TRUE` to force installation
# get IDs
uoa_table <- refimpact::ref_get("ListUnitsOfAssessment")
# apply 'ref_get' over all IDs
tt <- lapply(uoa_table$ID, function(x) refimpact::ref_get("SearchCaseStudies", 
  query = list(UoA = x)))

Please provide at least one usage example in the README.

Finally, I wonder if you want to include your vignette as part of the README. If so, simply add

to your README.Rmd.

sckott commented 7 years ago

thanks @njahn82 for your comments!

perrystephenson commented 7 years ago

Thanks everyone,

I'll try and respond to these in order.

@kbenoit

I've made comments about the bundling of the smaller tables here. If the case studies were small enough to bundle with the package then I would have just gone with a 100% data package, but given that users must use the API to make use of the package, there isn't any harm in getting the users to pull down the small tables. The possible benefit is that if there are new records added to the database, then users will get these records.

I have raised an issue for the help ?refimpact documentation and will sort it out soon.

And I'll take a look at Roxygen2 6.0!

@njahn82

I thought about backwards compatibility, and my thinking was along the lines of:

  1. It has barely any downloads (672 from the RStudio CRAN mirror as of today), and a Google search for library(refimpact) didn't return any meaningful hits. I doubt anyone is going to notice the lack of backward compatibility.
  2. It has no reverse dependencies on CRAN
  3. The old function names were not compliant with rOpenSci guidelines
  4. I incremented the major version number, which indicates breaking changes

I'm happy to take guidance on this if you think backwards compatibility is worth maintaining!

I can definitely split up the list - I left it as a list partly due to convenience (httr takes a list) and partly due to having seen that pattern in many other packages (and in fact in lots of base R functions). Is there a reason you think it's better as individual arguments?

Regarding the simplification of nested lists into useful data structures, I had a bit of a look at what this would take, and there is a fair bit of work in it. I've left the issue open and tagged it as something to look at for v1.1, unless you think it's important enough that it needs to be part of v1.0. I suspect it's going to take a while to think through all of the use cases though.

I'll have another look at the vignette and README based on your comments - I might use purrr to shorten the code example above, as I don't think that the apply functions are particularly good at conveying meaning to users looking at vignettes, and purrr tends to be a little easier to read.

njahn82 commented 7 years ago

Thanks for your reply, @perrystephenson

I am feeling unsure about what policies and good practices concerning deprecated or defunct functions are in place at ropensci, perhaps @sckott can help? Maybe there are users of your package who would appreciate it to find guidance about changes.

Regarding the ref_get() input, I found it a good practice among web api packages to call functions with simple atomic vectors because they make it easier to grasp what input is required. In the case of ref_get() I had to check with the API documentation to get an idea about what search parameters exist and what they mean. So I want to suggest to at least expand the ref_get() documentation with examples that cover the five existing query parameters.

It is great that you are planning to provide some parsers, and if you would like to add them in another release I am ok with it.

Looking forward to find purrr examples in the vignette!

sckott commented 7 years ago

In terms of deprecating and defunct, since it's already on CRAN, i advise to make a function deprecated in the next pkg version, then version after that make the function defunct - So that users have one version with warning about function going away, then next version it's gone - And do maintain man files for deprecated and defunct functions

sckott commented 7 years ago

any thoughts @perrystephenson ?

perrystephenson commented 7 years ago

Hi @sckott and sorry for the delay. Following from the discussion above:

I've decided to leave the "get all tags" example in the vignette as a loop. I think that loops are likely to be more readily understood by users who are new to programming (and new to R), which makes it a bit more friendly than using an apply function or purrr.

These changes have all been pushed to GitHub, so feel free to take a look.

Thanks again for the feedback!

sckott commented 7 years ago

thanks @perrystephenson - vacation for a week now, will have a look when I get back, sorry for the delay!

sckott commented 7 years ago

@perrystephenson looking good, checked it over, and nearly all is good. Just one thing

here https://github.com/perrystephenson/refimpact/blob/master/R/ref_get.R#L120 you are parsing content and then parsing JSON, whereas the HTTP status code check happens after that - why not do the status code check directly after https://github.com/perrystephenson/refimpact/blob/master/R/ref_get.R#L109

perrystephenson commented 7 years ago

Hi @sckott excellent question. The reason for this ordering is that the API gives useful error messages for some error conditions, and those error messages are worth passing on to the user. So following the API call I firstly check whether the response is a JSON (and throw an error if not), then I parse the JSON (which I now know I can do safely) to get either the content or the error message, then if there is an error I can add the parsed error message to the error string which gets presented to the user.

The alternative is to make the error messages less useful, which I think would be an inferior outcome.

sckott commented 7 years ago

True, you do check for JSON first, then parse after that is confirmed, so seems okay. - Def. agree to give useful error messages whenever possible

sckott commented 7 years ago

Approved!

perrystephenson commented 7 years ago

Thanks @sckott!

I've done all of those tasks now - is there anything I need to do before submitting this new version to CRAN? Do I need to add any rOpenSci references to DESCRIPTION?

Once again, thanks to everyone for the incredibly detailed feedback - I have learned a lot through the process! I really appreciate the time and effort you've committed to this process.

sckott commented 7 years ago

is there anything I need to do before submitting this new version to CRAN? Do I need to add any rOpenSci references to DESCRIPTION?

Just make sure to update the URLs in your DESCRIPTION file , right now they have the old github username

thanks for your submission!