stats19 - Githubissues

Robinlovelace commented 6 years ago

Summary

What does this package do? (explain in 50 words or less):

The goal of stats19 is to make it easy to work with road crash data. Specifically it enables access to and processing of the UK’s official road traffic casualty database, which is called STATS19.

Paste the full DESCRIPTION file inside a code block below:

Package: stats19
Title: Work with open road traffic casualty data from Great Britain
Version: 0.1.0.9000
Authors@R: c(
    person("Robin", "Lovelace", email = "rob00x@gmail.com", role = c("aut", "cre"),
    comment = c(ORCID = "0000-0001-5679-6536")),
    person("Malcolm", "Morgan", email = "m.morgan1@leeds.ac.uk", role = c("aut")),
    person("Layik", "Hama", email = "layik.hama@gmail.com", role = c("aut"))
  )
Description: Tools to help process and analysie the UK road collision data also known as Stats19.
Depends: R (>= 3.5.0)
License: GPL-3
Encoding: UTF-8
LazyData: true
Imports: 
    sf,
    foreign,
    readr,
    dplyr,
    lubridate,
    readxl
Suggests: 
    knitr,
    rmarkdown,
    testthat
VignetteBuilder: knitr
RoxygenNote: 6.1.1

URL for the package (the development repository, not a stylized html page): https://github.com/ITSLeeds/stats19
Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):
data retrieval: downloads and formats publicly available road safety data that is difficult (impossible for most people) to use effectively without
Who is the target audience and what are scientific applications of this package?

Academic, industry and public sector researchers investigating road safety, specifically people wanting to perform natural experimental studies to find out how best to make the transport system safer. It will also be of interest for people interested in large point-on-network data for methodological applications.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

Previous packages/code-bases:

code in the bikeR repo underlying an academic paper on collisions involving cyclists
functions in stplanr for downloading Stats19 data
updated functions related to the CyIPT project

but this package takes the best of the preceding packages and adds new features including:
automated and reproducible data formatting code based on DfT's official guidelines (before labels were hard-coded)
continuous integration
integration with sf (the stplanr implementation used sp)
inclusion of much more data (back to 1974 and until 2017, previous work only covered 1985:2015)
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Requirements

Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage, using services such as Travis CI, Coveralls and/or CodeCov.
[x] I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

[x] Do you intend for this package to go on CRAN?
[x] Do you wish to automatically submit to the Journal of Open Source Software? If so:
- [x] The package has an obvious research application according to JOSS's definition.
- [x] The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
- [ ] The package is deposited in a long-term repository with the DOI:
- (Do not submit your package separately to JOSS)
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
- [ ] The package is novel and will be of interest to the broad readership of the journal.
- [ ] The manuscript describing the package is no longer than 3000 words.
- [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
- (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
- (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
- (Please do not submit your package separately to Methods in Ecology and Evolution)

Detail

[x] Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:
[x] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
Currently the style uses = assignment, which differs from the recommended style
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

sckott commented 6 years ago

thanks for your submission @Robinlovelace We are discussing now and will get back to you soon

Robinlovelace commented 6 years ago

Great. It's in active development as you'll see. We have code to make all those millions of crash points sf data frames, we've just not pushed it. @layik has done great work refactoring the download functions so it makes it much easier to access this important dataset. Looking forward to hearing back from you and thanks for the update.

Robinlovelace commented 5 years ago

Heads-up @sckott format_sf() is now in there and the package now has a proper vignette. Are there any other things we should add at this stage for peer review? Thanks.

sckott commented 5 years ago

will dig into it today, i'll let you know

sckott commented 5 years ago

Editor checks:

[x] Fit: The package meets criteria for fit and overlap
[x] Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
[x] License: The package has a CRAN or OSI accepted license
[x] Repository: The repository link resolves correctly
[ ] Archive (JOSS only, may be post-review): The repository DOI resolves correctly
[ ] Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Thanks for your submission @Robinlovelace !

Here's the output from goodpractice. No need to address these now, it's info for reviewers to use to get started.

── GP stats19 ─────────────
It is good practice to

  ✖ write unit tests for all functions, and all package code in general. 95% of code lines are covered by test cases.

    R/dl_stats19.R:82:NA
    R/dl_stats19.R:83:NA
    R/format.R:55:NA
    R/format.R:106:NA
    R/format.R:157:NA
    ... and 12 more lines

  ✖ add a "URL" field to DESCRIPTION. It helps users find information about your package online. If your package does not have a homepage, add an URL
    to GitHub, or the CRAN package package page.
  ✖ add a "BugReports" field to DESCRIPTION, and point it to a bug tracker. Many online code hosting services provide bug trackers for free,
    https://github.com, https://gitlab.com, etc.
  ✖ use '<-' for assignment instead of '='. '<-' is the standard, and R users and developers are used it and it is easier to read your code for them if
    you use '<-'.

    R/dl_stats19.R:28:12
    R/dl_stats19.R:30:9
    R/dl_stats19.R:31:16
    R/dl_stats19.R:32:11
    R/dl_stats19.R:34:18
    ... and 204 more lines

  ✖ avoid long code lines, it is bad for readability. Also, many people prefer editor windows that are about 80 characters wide. Try make your lines
    shorter than 80 characters

    R/dl_stats19.R:28:1
    R/dl_stats19.R:104:1
    R/dl_stats19.R:105:1
    R/format.R:42:1
    R/format.R:98:1
    ... and 14 more lines

  ✖ avoid sapply(), it is not type safe. It might return a vector, or a list, depending on the input data. Consider using vapply() instead.

    R/format.R:257:20

  ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) expressions. They are error prone and result 1:0 if the expression on
    the right hand side is zero. Use seq_len() or seq_along() instead.

    R/format.R:247:11

  ✖ fix this R CMD check NOTE: Namespaces in Imports field not imported from: ‘dplyr’ ‘foreign’ ‘lubridate’ All declared Imports should be used.

Seeking reviewers now 🕐

Reviewers:

@adamhsparks due date: 2019-01-02
@daranzolin due date: 2019-01-02

Robinlovelace commented 5 years ago

Thanks @sckott. I'd seen goodpractice but never used it - that is very useful. After a few commits stats19 is improved. Latest results:

── GP stats19 ───────────────────────────────────────────────────────────────────────────

It is good practice to

  ✖ write unit tests for all functions, and all package code in
    general. 95% of code lines are covered by test cases.

    R/dl_stats19.R:85:NA
    R/dl_stats19.R:86:NA
    R/format.R:43:NA
    R/format.R:88:NA
    R/format.R:133:NA
    ... and 12 more lines

  ✖ use '<-' for assignment instead of '='. '<-' is the standard, and
    R users and developers are used it and it is easier to read your code for
    them if you use '<-'.

    R/dl_stats19.R:28:12
    R/dl_stats19.R:33:9
    R/dl_stats19.R:34:16
    R/dl_stats19.R:35:11
    R/dl_stats19.R:37:18
    ... and 203 more lines

sckott commented 5 years ago

Reviewers assigned:

@adamhsparks due date: 2019-01-02
@daranzolin due date: 2019-01-02

daranzolin commented 5 years ago

Prolegomena

Thanks to rOpenSci for the invitation to review, and special thanks to Robin, Malcolm, Layik, and Mark for this important package. I'm already a huge fan.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need clearly stating problems the software is designed to solve and its target audience in README

I would tease a bit more of what's in these data sets. I wasn't entirely sure until I downloaded and opened the supporting documentation. If I were searching for this kind of data, and I didn't know what STATS19 was, I'd like to know I'm in the right place after scanning the README. Maybe a map?

[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally

I couldn't load the vignette from the console:

vignette(package = "stats19")
#> no vignettes found

Admittedly, I also couldn't load the vignette for my own rOpenSci package, so I'm not sure what's missing.

[X] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally

Several of the examples failed:

read_casualties -- argument for dl_stats19 should be year instead of years
read_vehicles -- I think year must be specified and presupposes an existing download of Vehicles data, e.g. ve = read_vehicles(year = 2017)

stats19::read_vehicles()
#> Error in if (year %in% 1980:2003) {: argument is of length zero

check_input_file -- no example included
format_accidents -- same as above, argument for dl_stats19 should be year instead of years
format_casualties -- same as above, argument for dl_stats19 should be year instead of years
format_vehicles -- same as above, argument for dl_stats19 should be year instead of years
format_sf -- read_accidents requires a year argument
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

I couldn't find any explicit contributing guidelines in the README, and there is no CONTRIBUTING document.

For packages co-submitting to JOSS

[X] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[x] A short summary describing the high-level functionality of the software
[x] Authors: A list of authors with their affiliations
[x] A statement of need clearly stating problems the software is designed to solve and its target audience.
[x] References: with DOIs for all those that have one (e.g. papers, datasets, software).

There is no paper.md.

Functionality

[X] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[X] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.

Test results:


══ Results ═════════════════════════════════════════════════════════════════════════════════════════════════
Duration: 24.2 s

OK: 37 Failed: 0 Warnings: 0 Skipped: 0


- [X] **Packaging guidelines**: The package conforms to the rOpenSci packaging guidelines

#### Final approval (post-review)

- [x] **The author has responded to my review and made changes to my satisfaction. I recommend approving this package.**

Estimated hours spent reviewing: 3.5

---

### Review Comments

> A superb and essential package--we need this data and we need it in these formats. The download-format-read-explore workflow is intuitive and relatively frictionless. I have only some brief comments:

* I wonder you could possibly merge the formatting and reading step with a `raw = TRUE` or `format = TRUE` argument in the `read_*` functions. But perhaps that's my tendency towards abstraction. Something like `ac = read_accidents(year = 2017, format = TRUE)`

* My personal preference would be to have the schema from `dl_schema` lazily loaded with the package.

* [According to the vignette,](https://github.com/ITSLeeds/stats19/blob/master/vignettes/stats19.Rmd) the `dl_*` functions are interactive, although the interactivity is commented out in the code. Will the interactivity be returning? Or does the vignette need to be updated?

* Out of curiosity, what's happening with https://github.com/cyipt/stats19? It was updated recently.

* I confess I wish the package name was more expressive--stats19 sounds like an introductory statistics class. 

* This data will be used to make many maps. I personally would love a nudge in that direction in either the README or the vignette.

sckott commented 5 years ago

thanks very much for your review @daranzolin

mpadge commented 5 years ago

Yeah, thanks @daranzolin for an impressively prompt review. I'll leave the response proper to @layik and @Robinlovelace, but will confess guilty here to having commented out the interactivity in the dl_* functions. That's all my doing and I should already have reinstated it. Apologies to all, and yes, it will be returning.

Robinlovelace commented 5 years ago

Many thanks for the quick yet detailed review @daranzolin. I'm working on it this week and will get back to you on each of those constructive comments asap. More soon...

layik commented 5 years ago

Great review @daranzolin. Appreciate it. Thanks @Robinlovelace and @mpadge. Apologies I just missed this thread for some reason.

Robinlovelace commented 5 years ago

Heads-up all (especially @daranzolin, @mpadge and @layik - the originator of interactive downloads in stats19), I've re-added interactivity, this time using menu() which is best practice for this kind of thing and fewer lines of code.

Please check out the video to see how it works here:

And if you're interested in the code used to do this, see here: https://github.com/ITSLeeds/stats19/pull/36

Feedback on behaviour or code welcome before I merge this. Many thanks @layik for implementing this 1st time around.

layik commented 5 years ago

Just going through the review @daranzolin, RE vignette, I think what you needed to do was

devtools::install(build_vignettes = TRUE)

then you would be able to load the vignette.

Robinlovelace commented 5 years ago

Heads-up we've addressed most of the issues raised by @daranzolin I think. Many thanks for a great set of comments. We're documenting them here https://github.com/ITSLeeds/stats19/issues/49 and plan to put the review responses here https://github.com/ITSLeeds/stats19/blob/master/responses1.Rmd

Not quite finished but plan to properly respond in a comment here by Saturday this week - sorry it's taken a while.

layik commented 5 years ago

Hi @daranzolin, as promised by @Robinlovelace find our responses and thank you very much for taking time to do the review and please feel free to let us know if you have any further feedback.

=====

Responses to review 1 of stats19

Thanks for the review. We've had a chance, after making some changes and fixes to the package, to take-in and act on each of the comments. The code-base has evolved substantially since the review, but the fundamental design of the package, with its 3 stage API mirroring workflows that happened before the package was developed, remains unchanged. That is:

dl_stats19() downloads files from the DfT. Good news: we have spoken to the relevant people at the Department for Transport and they assured us that the endpoints are stable. The function now uses menu() to provide a menu of download options for any year/type combinations and now finds files outside those explicitly mentioned in the file names. E.g.:

dl_stats19(year = 2017)
# Multiple matches. Which do you want to download?
# 
# 1: dftRoadSafetyData_Vehicles_2017.zip
# 2: dftRoadSafetyData_Casualties_2017.zip
# 3: dftRoadSafetyData_Accidents_2017.zip
dl_stats19(year = 2017, type = "ac")
# Files identified: dftRoadSafetyData_Accidents_2017.zip
# 
# Wanna do it (y = enter, n = esc)? 
dl_stats19(year = 1985)
# Year not in range, changing to match 1979:2004 data
# This file is over 240 MB in size.
# Once unzipped it is over 1.8 GB.
# Files identified: Stats19-Data1979-2004.zip
# 
# Wanna do it (y = enter, n = esc)?

read_*() these functions remain unchanged, except the order of arguments has changed. Like dl_stats19(), year is now the first argument, which is more intuitive.
format_*() functions have been refactored. Each now uses format_stats19() behind the scenes reducing duplication. The results are now better: more variables are now labelled.

Note: the last two stages is now combined by default as per this review read_*(format = TRUE) by default.

We'll focus on areas flagged in the review for the rest of this response:

I would tease a bit more of what's in these data sets. I wasn't entirely sure until I downloaded and opened the supporting documentation. If I were searching for this kind of data, and I didn't know what STATS19 was, I'd like to know I'm in the right place after scanning the README. Maybe a map?

We have added a map (well technically 9 maps!) and a couple of time series plots showing the scale of the data. Also show a sample of the additional casualty and vehicle tables has been added to show more clearly the richness of data provided.

I couldn't load the vignette from the console:

We also could not see the vignette when installing using devtools::install_github(build_vignettes = TRUE. But we can see the vignette if we install locally.

This was the code we ran:

devtools::install(build_vignettes = TRUE)

## Installing stats19

vignette(package = "stats19")

Several of the examples failed:

These have now been fixed - thanks for testing and reporting.

I couldn't find any explicit contributing guidelines in the README, and there is no CONTRIBUTING document.

A CONTRIBUTING is added now. Thank you.

The package has an obvious research application according to JOSS's definition

There is no paper.md.

One is added with:

A short summary describing the high-level functionality of the software
Authors: A list of authors with their affiliations
A statement of need clearly stating problems the software is designed to solve and its target audience.
References: with DOIs for all those that have one (e.g. papers, datasets, software).

Review Comments

A superb and essential package--we need this data and we need it in these formats. The download-format-read-explore workflow is intuitive and relatively frictionless. I have only some brief comments:

Thank you.

I wonder you could possibly merge the formatting and reading step with a raw = TRUE or format = TRUE argument in the read_* functions. But perhaps that's my tendency towards abstraction. Something like ac = read_accidents(year = 2017, format = TRUE)

Done, appreciate your input.

My personal preference would be to have the schema from dl_schema lazily loaded with the package.

DESCRIPTION: has the line LazyData which means stats19_schema is lazy loaded.

According to the vignette, the dl_* functions are interactive, although the interactivity is commented out in the code. Will the interactivity be returning? Or does the vignette need to be updated?

Back in, as stated above.

Out of curiosity, what's happening with https://github.com/cyipt/stats19? It was updated recently.

@mem48 answered this: cyipt/stats19 is not actually a proper R package. It is a repo containing scripts for CyIPT project, it has different sources (UK DS), and usage so there is no current need to adapt the use to this package. Malcolm is one of the contributors to this package.

I confess I wish the package name was more expressive--stats19 sounds like an introductory statistics class.

This a reasonable point that we have thought of and discussed. We are open minded about changing the name but, as with so many things, there are +s and -s (outlined for some options below):

stats19data
- clarifies that it's about data
- longer, suffers from some of the same issues that stats19 suffers from, the package is more about data formatting than data provision
roadcrashesUK
- explicit, makes region of data access transparent
- there are other types of road crash data, also the data currently provided is technically for Great Britain, but roadcrashesGB doesn't work so well, and we may want to add data access options for Northern Ireland at some point also
roadSafetyData
- Matches DfT's webpage title on the topic
- longer and, again, is less specific.

The main benefit we can see of changing the name would be making the package easier to find. We think good documentation and clear description and some write-ups of the package and what it does could address these issues. We've explored stat19 name and it links directly to (and is almost synonymous with) road crash data. See https://en.wikipedia.org/wiki/STATS19 for an excellent example (we plan to add this link to the README)

so the name is OK for we think, but we're open minded to alternative names mentioned above and perhaps names we've not thought of.

This data will be used to make many maps. I personally would love a nudge in that direction in either the README or the vignette.

Definitely. Thank you very much for your input.

Robinlovelace commented 5 years ago

Quick question for @sckott, what is you policy regarding CRAN submission while it's under review? We may change this in response the the answer (or we submit tomorrow!): https://github.com/ITSLeeds/stats19/milestone/1.

Look forward to your response @adamhsparks.

daranzolin commented 5 years ago

Thanks @Robinlovelace @layik @mpadge -- the new vignette and README looks great. @sckott, all of my comments have been addressed.

One last observation: I'm getting parsing failures on the Accident_Index column with both read_vehicles and read_casualties:

> vehicles = read_vehicles()
Warning: 298687 parsing failures.
row       col               expected  actual                                                                                                                       file
  1 Acc_Index no trailing characters BS70001 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
  2 Acc_Index no trailing characters BS70001 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
  3 Acc_Index no trailing characters BS70002 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
  4 Acc_Index no trailing characters BS70002 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
  5 Acc_Index no trailing characters BS70003 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Vehicles_2009/ [... truncated]
Warning message:
The following named parsers don't match the column names: Accident_Index

> casualties = read_casualties()
Warning: 222146 parsing failures.
row       col               expected  actual                                                                                                                           file
  1 Acc_Index no trailing characters BS70001 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
  2 Acc_Index no trailing characters BS70002 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
  3 Acc_Index no trailing characters BS70002 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
  4 Acc_Index no trailing characters BS70002 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
  5 Acc_Index no trailing characters BS70002 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadS [... truncated]
Warning message:
The following named parsers don't match the column names: Accident_Index

Is it because the actual column name is acc_index instead of Accident_Index?

Robinlovelace commented 5 years ago

Thanks @daranzolin, which years do you get parsing failures for? Cannot reproduce for 2017 data, as illustrated in the reprex below:

devtools::install_github("ITSLeeds/stats19")
#> Skipping install of 'stats19' from a github remote, the SHA1 (225bce03) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(stats19)
#> Data provided under the conditions of the Open Government License.
#> If you use data from this package, mention the source
#> (Department for Transport), cite the package and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/.

# casualties
dl_stats19(year = 2017, type = "casualties")
#> Files identified: dftRoadSafetyData_Casualties_2017.zip
#> Attempt downloading from:
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2017.zip
#> Data saved at /tmp/Rtmpw3fpFE/dftRoadSafetyData_Casualties_2017/Cas.csv
ca = read_casualties(year = 2017)
head(ca$accident_index)
#> [1] "2017010001708" "2017010001708" "2017010001708" "2017010009342"
#> [5] "2017010009344" "2017010009348"

# vehicles
dl_stats19(year = 2017, type = "vehicles")
#> Files identified: dftRoadSafetyData_Vehicles_2017.zip
#> Attempt downloading from:
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Vehicles_2017.zip
#> Data saved at /tmp/Rtmpw3fpFE/dftRoadSafetyData_Vehicles_2017/Veh.csv
ve = read_vehicles(year = 2017)
head(ve$accident_index)
#> [1] "2017010001708" "2017010001708" "2017010009342" "2017010009342"
#> [5] "2017010009344" "2017010009344"

^{Created on 2018-12-21 by the reprex package (v0.2.1)}

daranzolin commented 5 years ago

It was the 2009 datasets for vehicles and casualties. Apologies for the missing reprex.

devtools::install_github("ITSLeeds/stats19")
#> Skipping install of 'stats19' from a github remote, the SHA1 (53a82fe9) has not changed since last install.
#>   Use `force = TRUE` to force installation
#> Skipping install of 'stats19' from a github remote, the SHA1 (225bce03) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(stats19)
#> Data provided under the conditions of the Open Government License.
#> If you use data from this package, mention the source
#> (UK Department for Transport), cite the package, and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/.
#> Data provided under the conditions of the Open Government License.
#> If you use data from this package, mention the source
#> (Department for Transport), cite the package and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/.

# casualties
dl_stats19(year = 2009, type = "casualties")
#> Files identified: DfTRoadSafety_Casualties_2009.zip
#> Attempt downloading from:
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2009.zip
#> Data saved at /var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv
#> Files identified: dftRoadSafetyData_Casualties_2017.zip
#> Attempt downloading from:
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2017.zip
#> Data saved at /tmp/Rtmpw3fpFE/dftRoadSafetyData_Casualties_2017/Cas.csv
ca = read_casualties(year = 2009)
#> Warning: The following named parsers don't match the column names:
#> Accident_Index
#> Warning: 222146 parsing failures.
#> row       col               expected  actual                                                                                                                           file
#>   1 Acc_Index no trailing characters BS70001 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
#>   2 Acc_Index no trailing characters BS70002 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
#>   3 Acc_Index no trailing characters BS70002 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
#>   4 Acc_Index no trailing characters BS70002 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
#>   5 Acc_Index no trailing characters BS70002 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
#> ... ......... ...................... ....... ..............................................................................................................................
#> See problems(...) for more details.
head(ca$accident_index)
#> Warning: Unknown or uninitialised column: 'accident_index'.
#> NULL

# vehicles
dl_stats19(year = 2009, type = "vehicles")
#> Files identified: DfTRoadSafety_Vehicles_2009.zip
#> Attempt downloading from:
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Vehicles_2009.zip
#> Data saved at /var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv
#> Files identified: dftRoadSafetyData_Vehicles_2017.zip
#> Attempt downloading from:
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Vehicles_2017.zip
#> Data saved at /tmp/Rtmpw3fpFE/dftRoadSafetyData_Vehicles_2017/Veh.csv
ve = read_vehicles(year = 2009)
#> Warning: The following named parsers don't match the column names:
#> Accident_Index
#> Warning: 298687 parsing failures.
#> row       col               expected  actual                                                                                                                       file
#>   1 Acc_Index no trailing characters BS70001 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
#>   2 Acc_Index no trailing characters BS70001 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
#>   3 Acc_Index no trailing characters BS70002 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
#>   4 Acc_Index no trailing characters BS70002 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
#>   5 Acc_Index no trailing characters BS70003 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
#> ... ......... ...................... ....... ..........................................................................................................................
#> See problems(...) for more details.
head(ve$accident_index)
#> Warning: Unknown or uninitialised column: 'accident_index'.
#> NULL

^{Created on 2018-12-21 by the reprex package (v0.2.1)}

layik commented 5 years ago

Hi David. I will have a look at 2009 and see what the issue is and report.

Robinlovelace commented 5 years ago

Hi @daranzolin thanks loads for reporting. I think it's fixed on this branch which I'm planning to merge, as demonstrated in this reprex:

devtools::install_github("ITSLeeds/stats19", ref = "index")
library(stats19)
#> Data provided under the conditions of the Open Government License.
#> If you use data from this package, mention the source
#> (UK Department for Transport), cite the package, and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/.
dl_stats19(year = 2009, type = "casualties")
#> Files identified: DfTRoadSafety_Casualties_2009.zip
#> Attempt downloading from:
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2009.zip
#> Data saved at /tmp/RtmpFsdiJv/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv
ca = read_casualties(year = 2009)
head(ca$accident_index)
#> [1] "200901BS70001" "200901BS70002" "200901BS70002" "200901BS70002"
#> [5] "200901BS70002" "200901BS70002"
dl_stats19(year = 2009, type = "vehicles")
#> Files identified: DfTRoadSafety_Vehicles_2009.zip
#> Attempt downloading from:
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Vehicles_2009.zip
#> Data saved at /tmp/RtmpFsdiJv/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv
ve = read_vehicles(year = 2009)
head(ve$accident_index)
#> [1] "200901BS70001" "200901BS70001" "200901BS70002" "200901BS70002"
#> [5] "200901BS70003" "200901BS70003"

^{Created on 2018-12-23 by the reprex package (v0.2.1)}

Robinlovelace commented 5 years ago

Update: index branch has now been merged.

sckott commented 5 years ago

@adamhsparks your review is due in 1 week, thanks very much

sckott commented 5 years ago

@Robinlovelace we usually don't have maintainers submit to CRAN during the review but if its already on CRAN and there is a bug fix patch, then seems justifiable to submit to CRAN during review

adamhsparks commented 5 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be auto-generated via Authors@R).

For packages co-submitting to JOSS

[x] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[x] A short summary describing the high-level functionality of the software

[x] Authors: A list of authors with their affiliations

[x] A statement of need clearly stating problems the software is designed to solve and its target audience.

[x] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 7.5

Review Comments

Clearly much effort has gone into this package. I greatly support the sentiment behind making these data available in R having done the same for other data sets, myself. This package should be a great benefit to researchers in this area. I really appreciate the slim dependencies. It will make this package much easier to maintain into the future.

I found the package to be well documented, the vignette is helpful in illustrating some use cases for the data along with how to access it and the code is clear.

Some of the functionality I find to be mildly confusing like downloading and then importing the files into the R session and then formatting. As a user I'd prefer it all in one step, but there are likely use cases I'm not aware of that mean that this is useful so some examples of this would be useful I think.

My general comments on the code follow and sections for the DESCRTIPTION and Vignette as well. I've commented quite a bit on grammar and spelling as I think that the polish of a package is important as it lends to the perception of the quality.

Thanks for inviting me to review this. It's been fun and I've learned some new things.

Per rOpenSci policy, avoid start-up messages, rely on the documentation for citation information: https://ropensci.github.io/dev_guide/building.html#recommended-scaffolding.
Avoid long lines >80 chars
Inconsistent use of white spaces in code, see find_file_name() if statements for examples.
The package does not pass R CMD check. curl, readxl and tibble are all listed as an Imports in DESCRIPTION but not imported from. With curl being used in tests, this means it should be in Suggests, I think. The others should be removed.
I don't think it's good form to have an example that won't work on Windows in the help file for stats19_schema, from data.R - line 17? Most of what I see there would be better served in a data_raw folder showing how the data were created with the documentation actually documenting what the variables are not how they were created, see http://r-pkgs.had.co.nz/data.html and for an example, https://github.com/ropensci/GSODR/tree/master/data-raw.
I would suggest to use proper formatting in help files, when naming packages, e.g. \pkg{stats19} and when referring to documented functions or data, e.g. \code{\link{stats19_schema}}, or with single quotes around abbreviations, e.g. 'DfT'. @ColinFay has an excellent page that outlines the formatting options and when/how to use them, https://colinfay.me/writing-r-extensions/writing-r-documentation-files.html. This will greatly enhance the users' experience when using the help files by making them more readable.
I also would suggest making use of @seealso in documentation. For example, the dl_stats19() example works great in the help files, but from there I have the data but it's not in R. Using the @seealso you can let the user know about the read_*() functions.
I downloaded files using dl_stats19(), selecting "Casualties", and then ran read_accidents() and got
```
Error in check_input_file(filename = filename, type = "accidents", data_dir = data_dir,  : 
Change data_dir, filename, year or run dl_stats19() first.
```
Is it possible to be more descriptive and say that I've used the wrong read_*() based on the file/data found and offer to import it?
Missing "." after "e.g." in dl.R on lines 8 and 9, there may be others that I didn't spy.
Capitalisation in help files is inconsistent, e.g. lines 123-125 of read.R, parameter descriptions are mixed upper and lower case for first word after parameter itself. This applies to other functions where the descriptions are given in all lower case for other functions or upper case.
Testing the functionality, I get this, when I expect it to tell me that deaths is not a valid input. But then when I hit escape, I expect it simply exit, not provide a warning message on the way out as well.
```
dl_stats19(year = 1979, type = "deaths")
No files of that type found for that year.
This will download 240 MB+ (1.8 GB unzipped).
Files identified: Stats19-Data1979-2004.zip
```

Download now (y = enter, n = esc)?

Warning message: In find_file_name(years = year, type = type) : Coordinates unreliable in this data.



- I got caught out when using the interactive features. I read "y = enter" but hit "y" thinking that would work as well as hitting "enter", but R cancelled the operation anyway just as if I'd hit "esc"

- Per a recent conversation with CRAN, you should use `donttest()` rather than `dontrun()` for examples you don't want to be run on CRAN. Then set .travis.yml to run them by using `r_check_args: --as-cran --run-donttest`. **This may not be appropriate in all cases, e.g. interactive functions.**

- When validating user inputs and using `stop()` it's nice to use `call. = FALSE` to simplify the error message that the user receives.

- Consider using [`hoardr`](https://ropensci.github.io/hoardr/) for managing user-saved files on disk that aren't in `tempdir()`?

- When using `utils::download.file()`, you should use `mode = "wb"` or Windows users may end up with corrupted downloads in my experience. `curl::curl_download()` does the same thing but uses more updated ways of doing it and defaults to using a binary mode (wb).

- I don't think that there is much need for the `Attempting download from` or `Reading in: ` message. If it takes that long, I would suggest to use a progress bar to show progress. But this is just a personal observation.

- Consider setting up a `pkgdown` site? It's easy to do and you can automate deployment with your Travis-CI so it's less to remember.

#### Tests

- I'm unclear how the interactive portion of the package functions is handled in testing? There are ways to handle this, but I don't see any implemented and when I run `devtools::test()` I'm asked to provide my own input.

- Suggest using `skip_on_cran()` since some of the tests can take some time to execute due to download times.

#### DESCRIPTION File

- In the DESCRIPTION file, Mark's author entry is missing his ORCID.

- More information in the DESCRIPTION's Description field would be desirable, a link to the data's website or other information to give more background perhaps.

- STATS19 should be in "'" in DESCRIPTION for CRAN, i.e., 'STATS19', I think.

- Check spelling in DESCRIPTION file, see: "analysie".

- The Description should include a link to the DfT website.

- Language field should be set, `Language: en-GB`.

#### README File(s)

- Use `remotes::install_github()` in place of `devtools::install_github()` in README.

- The code style is inconsistent in the README.Rmd file in the code chunks, e.g. line 85 is missing space around `=`.

- The example in the README showing two steps seems necessarily confusing to new users. If there is a good reason for having the raw data in R, document in a vignette why this is useful and show the two-step process, but if normal users won't do this, I wouldn't show it in the quick-start.

- Line 43 of README uses inconsistent "(" around the phrases with the other two `read_*` function description.

#### Vignette

- Run spell-check on it.

- The term "attach"" has a specific meaning in R. Suggest rewording the portion about installation and loading the package to omit the use of "attach", since you're not using `attach()` in the R sense (and really shouldn't use it anyway).

- I would describe why a user might want or need to install the Development version from GitHub in the vignette. Presumably if they are reading the vignette, they've already installed the package from CRAN (in the future).

- Try to consistently use `function()` to identify functions in the vignette text. This also means that if/when you use pkgdown to build a site, the functions are linked to the help file.

- In the introduction, the description of why there are `read_*()` and `format_*()` functions is confusing. To me, it reads as if `format` is only a parameter for `read_*()` in the introduction. I was left wondering why it's documented there or why the `format_*()`s even exist until I reached the end of the vignette.

- There is a comma out of place in Vignette,

>Format: Each of the read_*() functions has a format parameter which, when TRUE, adds

should be 

>Format: Each of the read_*() functions has a format parameter, which, when TRUE, adds 

- I'm unsure about including a package that's not on CRAN in the vignette (`ukboundaries`), something like this should be listed in Suggests, but it's not on CRAN, @sckott do you have any thoughts?

- The first figures in the `sf` section after the join aren't immediately clear to me. The axis lack labels, I'm not really sure what I'm looking at.

#### Meta

- The contributing guidelines mention a `pkgdown` website, this does not exist.

- The JOSS paper needs spelling check run on it to correct spelling errors.

This review is also hosted as a repository on my GitHub, https://github.com/adamhsparks/stats19-review

adamhsparks commented 5 years ago

Also, as I've been working on this, I've modified the DESCRIPTION file to fix some errors, and address most of the points I raised. I've also run spellcheck on the vignette if you'd accept a PR.

sckott commented 5 years ago

thanks for your review @adamhsparks !

layik commented 5 years ago

That is a great review @adamhsparks. Thank you. Thanks @sckott appreciate your follow up!

adamhsparks commented 5 years ago

Sorry, I missed the paper.md in the root dir. I've updated my review to reflect this.

Robinlovelace commented 5 years ago

Hi @adamhsparks no problem, and just saw the PR. Will look now. Meanwhile, I've put our work-in-progress response to your review here: https://github.com/ITSLeeds/stats19/issues/63 - any questions, please ask. Very grateful for the level of detail and knowledge in the review. Thank you!

Robinlovelace commented 5 years ago

One question, probably more for @sckott: does JOSS accept figures and code chunks? Thinking that could liven it up a bit...

Robinlovelace commented 5 years ago

OK just merged those 2 mini PRs. That raises another question: are both reviewers happy to be in the author list with the new rev tag? More info here: https://ropensci.org/blog/2018/03/16/thanking-reviewers-in-metadata/

adamhsparks commented 5 years ago

Hi @Robinlovelace, happy to be listed as a reviewer, "rev", and contributor, "ctb", for the pull requests you've accepted. My ORCID: 0000-0002-0061-8359.

daranzolin commented 5 years ago

@Robinlovelace rev is great for me! Thanks for asking.

Robinlovelace commented 5 years ago

Great, I've add both reviewers in https://github.com/ITSLeeds/stats19/pull/67/files

sckott commented 5 years ago

@Robinlovelace will ask, not sure about the JOSS question

sckott commented 5 years ago

AFAICT figures can be included https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain - it doesn't say anything about code chunks, but you could try

Robinlovelace commented 5 years ago

Happy to say, we've finally responded to all the points made by @adamhsparks. We've learned a lot in the process, and have added an auto deploy website (see https://itsleeds.github.io/stats19/ ), plus various current best practice things like donttest that we didn't know about. Thanks loads!

Any further comments/suggestions: very welcome.

Overall, we found all the suggestions sensible, linking to previous discussions about combining the 3 stage (dl, read, format) process into a single function call, which we were thinking of calling get_stats19(). The reason for splitting the process up is to ensure maximum transparency and to give the user control over what the package is doing. However, as long as it is properly documented, we think the benefits of a get_stats19() function will outweigh any possible negatives we can think of. get_stats19() has now been added!

[x] Clearly much effort has gone into this package. I greatly support the sentiment behind making these data available in R having done the same for other data sets, myself. This package should be a great benefit to researchers in this area. I really appreciate the slim dependencies. It will make this package much easier to maintain into the future.

Thanks for the comments, we have indeed tried to keep dependencies to a minimum but consider readr worthwhile. readxl and tibble have been removed. curl has been demoted to Suggests, as detailed in another comment.

[x] I found the package to be well documented, the vignette is helpful in illustrating some use cases for the data along with how to access it and the code is clear.

Thanks. If you think of other was we can communicate the value of the data, do let us know (I think the second mapping figure could be improved...).

[x] Some of the functionality I find to be mildly confusing like downloading and then importing the files into the R session and then formatting. As a user I'd prefer it all in one step, but there are likely use cases I'm not aware of that mean that this is useful so some examples of this would be useful I think.

We have long been planning to add a get_stats19() function as per https://github.com/ITSLeeds/stats19/issues/11 The review comment, combined with further discussion, has triggered us to re-prioritise it. It's been beneficial to polish each of the component functions first, however, and good to document each stage for maximum transparency, however, so we plan to keep the dl, read and format functions exported.

[x] My general comments on the code follow and sections for the DESCRTIPTION and Vignette as well. I've commented quite a bit on grammar and spelling as I think that the polish of a package is important as it lends to the perception of the quality.

Agreed.

[x] Per rOpenSci policy, avoid start-up messages, rely on the documentation for citation information: https://ropensci.github.io/dev_guide/building.html#recommended-scaffolding.

The guidance is to 'Only use package startup messages when necessary'. A case can be made that this is necessary. As with osmdata, the package provides access to data that has a license that requires it to be cited. The osmdata load message is as follows:

library(osmdata)

We fully agree with the reasoning behind remove package startup messages however. As a compromise, we've shortened the startup from 4 lines to 2:

# before:
# Data provided under the conditions of the Open Government License.
# If you use data from this package, mention the source
# (UK Department for Transport), cite the package, and link to:
# www.nationalarchives.gov.uk/doc/open-government-licence/version/3/.

# after:
library(stats19)

Data provided under OGL v3.0. Cite the source and link to:
www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

[x] Avoid long lines >80 chars

running goodpractice::gp() found the following lines with > 80 lines:

    R/format.R:62:1
    R/format.R:67:1
    R/read.R:141:1
    R/utils.R:167:1

All these have been fixed.

[x] Inconsistent use of white spaces in code, see find_file_name() if statements for examples.
[x] The package does not pass R CMD check. curl, readxl and tibble are all listed as an Imports in DESCRIPTION but not imported from. With curl being used in tests, this means it should be in Suggests, I think. The others should be removed.

Kindly make sure you build before running check the resultinggz` archive.

curl is used in the tests and readxl is used in the examples. These have been demoted to Suggests. tibble has been removed from the DESCRIPTION file.

[x] I don't think it's good form to have an example that won't work on Windows in the help file for stats19_schema, from data.R - line 17? Most of what I see there would be better served in a data_raw folder showing how the data were created with the documentation actually documenting what the variables are not how they were created, see http://r-pkgs.had.co.nz/data.html and for an example, https://github.com/ropensci/GSODR/tree/master/data-raw.

See https://github.com/ITSLeeds/stats19/pull/68

[x] I would suggest to use proper formatting in help files, when naming packages, e.g. \pkg{stats19} and when referring to documented functions or data, e.g. \code{\link{stats19_schema}}, or with single quotes around abbreviations, e.g. 'DfT'. @ColinFay has an excellent page that outlines the formatting options and when/how to use them, https://colinfay.me/writing-r-extensions/writing-r-documentation-files.html. This will greatly enhance the users' experience when using the help files by making them more readable.

Acronyms have been explained and better formatting of 'STATS19' (note, it's capitalised when referring to the data, but lower case when referring to the package), and a few other improvements to the docs, have been made. See https://github.com/ITSLeeds/stats19/commit/e4ab09aa716630322aadd26dee3465258a2e58a0

[x] I also would suggest making use of @seealso in documentation. For example, the dl_stats19() example works great in the help files, but from there I have the data but it's not in R. Using the @seealso you can let the user know about the read_*() functions.

@seealso tags have now been added to a few key documentation pages.

[x] I downloaded files using dl_stats19(), selecting "Casualties", and then ran read_accidents() and got Is it possible to be more descriptive and say that I've used the wrong read_*() based on the file/data found and offer to import it?

Yes, but I think the costs of doing this would outweigh the benefits (RL)

[x] Missing "." after "e.g." in dl.R on lines 8 and 9, there may be others that I didn't spy.

Fixed in https://github.com/ITSLeeds/stats19/commit/777479b9234be893a9946c787ef23cc08ff6a41a

[x] Capitalisation in help files is inconsistent, e.g. lines 123-125 of read.R, parameter descriptions are mixed upper and lower case for first word after parameter itself. This applies to other functions where the descriptions are given in all lower case for other functions or upper case.

Fixed in https://github.com/ITSLeeds/stats19/commit/aec13fc0673717d68a3eb159c066d76a8d0645ca

[x] Testing the functionality, I get this, when I expect it to tell me that deaths is not a valid input. But then when I hit escape, I expect it simply exit, not provide a warning message on the way out as well.

dl_stats19(year = 1979, type = "deaths")
No files of that type found for that year.
This will download 240 MB+ (1.8 GB unzipped).
Files identified: Stats19-Data1979-2004.zip

Download now (y = enter, n = esc)? 

Warning message:
In find_file_name(years = year, type = type) :
  Coordinates unreliable in this data.

Now produces the following output: Error in dl_stats19(year = 1979, type = "deaths") : Stopping as requested

[x] I got caught out when using the interactive features. I read "y = enter" but hit "y" thinking that would work as well as hitting "enter", but R cancelled the operation anyway just as if I'd hit "esc"

Agreed, that was an issue. Fixed in the latest version (tested with > stats19::dl_stats19(year = 1979))

[x] Per a recent conversation with CRAN, you should use donttest() rather than dontrun() for examples you don't want to be run on CRAN. Then set .travis.yml to run them by using r_check_args: --as-cran --run-donttest. This may not be appropriate in all cases, e.g. interactive functions.

Trying this, as per: https://github.com/ITSLeeds/stats19/pull/69

[x] When validating user inputs and using stop() it's nice to use call. = FALSE to simplify the error message that the user receives.

Add as requested. See https://github.com/ITSLeeds/stats19/pull/70

[x] Consider using hoardr for managing user-saved files on disk that aren't in tempdir()?

We have considered this and have taken a good look at the package. In this case, however, I don't think it's appropriate: saving the 1/4 GB .zip file associated with the 1979:2005 data, for example, does not make sense when it can be extracted and saved in a more useful format. We assume the user will extract and save the data they use. Furthermore, the DfT is trialling an API which should make the desire to save intermediate files even rarer. Users are told where the files are saved, so are free to move them where they like.

[x] When using utils::download.file(), you should use mode = "wb" or Windows users may end up with corrupted downloads in my experience. curl::curl_download() does the same thing but uses more updated ways of doing it and defaults to using a binary mode (wb).

We have tested on Windows and have not found any issues with the dl_stats19() function - so treating this as a 'wontfix', unless a strong reason can be found for adding this.

[x] I don't think that there is much need for the Attempting download from or Reading in: message. If it takes that long, I would suggest to use a progress bar to show progress. But this is just a personal observation.

There are read and download progress bars for large datasets. Try this, for example:

dl_stats19(year = 1979)

[x] Consider setting up a pkgdown site? It's easy to do and you can automate deployment with your Travis-CI so it's less to remember.

A site has been added: https://itsleeds.github.io/stats19/

I'm testing auto-deploy on travis and will happily merge this branch if it works (RL): https://github.com/ITSLeeds/stats19/pull/71

Update: auto deploy failed so reverting that change to .travis.yml. pkgdown that was making it fail. See https://travis-ci.org/ITSLeeds/stats19/builds/475415270#L1622 And: https://ropenscilabs.github.io/travis/

Is this is an ropensci pkg issue and is it worth asking for support?

Tests

[x] I'm unclear how the interactive portion of the package functions is handled in testing? There are ways to handle this, but I don't see any implemented and when I run devtools::test() I'm asked to provide my own input.

When R is running non-interactively it auto-downloads the necessary files or produces an error. All tests involving downloads work on our computers and Travis. They can be turned off for the duration of your session with:

Sys.setenv(DONT_DOWNLOAD_ANYTHING = "true")

Not sure if that answers the question - any further feedback on this: welcome.

[x] Suggest using skip_on_cran() since some of the tests can take some time to execute due to download times.

Added in https://github.com/ITSLeeds/stats19/pull/72

DESCRIPTION File

See fixes here: https://github.com/ITSLeeds/stats19/commit/fdeb634dbab5a39ab599b35693623d0b1c94f71d

[x] In the DESCRIPTION file, Mark's author entry is missing his ORCID.
[x] More information in the DESCRIPTION's Description field would be desirable, a link to the data's website or other information to give more background perhaps.
[x] STATS19 should be in "'" in DESCRIPTION for CRAN, i.e., 'STATS19', I think.
[x] Check spelling in DESCRIPTION file, see: "analysie"
[x] The Description should include a link to the DfT website.
[x] Language field should be set, Language: en-GB

README File(s)

[x] Use remotes::install_github() in place of devtools::install_github() in README.
[x] The code style is inconsistent in the README.Rmd file in the code chunks, e.g. line 85 is missing space around =.
[x] The example in the README showing two steps seems necessarily confusing to new users. If there is a good reason for having the raw data in R, document in a vignette why this is useful and show the two-step process, but if normal users won't do this, I wouldn't show it in the quick-start.
[x] Line 43 of README uses inconsistent "(" around the phrases with the other two read_* function description.

Vignette

[x] Run spell-check on it.
[x] The term "attach"" has a specific meaning in R. Suggest rewording the portion about installation and loading the package to omit the use of "attach", since you're not using attach() in the R sense (and really shouldn't use it anyway).
[x] I would describe why a user might want or need to install the Development version from GitHub in the vignette. Presumably if they are reading the vignette, they've already installed the package from CRAN (in the future).
[x] Try to consistently use function() to identify functions in the vignette text. This also means that if/when you use pkgdown to build a site, the functions are linked to the help file.
[x] In the introduction, the description of why there are read_*() and format_*() functions is confusing. To me, it reads as if format is only a parameter for read_*() in the introduction. I was left wondering why it's documented there or why the format_*()s even exist until I reached the end of the vignette.
[x] There is a comma out of place in Vignette,
[x] Format: Each of the read_*() functions has a format parameter which, when TRUE, adds

should be

[x] Format: Each of the read_*() functions has a format parameter, which, when TRUE, adds
[x] I'm unsure about including a package that's not on CRAN in the vignette (ukboundaries), something like this should be listed in Suggests, but it's not on CRAN, @sckott do you have any thoughts?

This is a good point. Fixed, by adding a much more useful dataset, representing the juristictions of police forces across England and Wales: police_boundaries.

[x] The first figures in the sf section after the join aren't immediately clear to me. The axis lack labels, I'm not really sure what I'm looking at.

Agreed. The maps were not particularly clear. An overview map has been added to the vignette. For context, and to ensure README-vignette harmony, we also changed the README's mapping section. It now also contains an overview map so people know what they are looking at, but not the facetted map, which was confusing.

Response to 2nd round of reviews

Many thanks for providing follow-up comments, they have been really useful and led to further improvements in the package. All the changes in response to them can be found here: https://github.com/ITSLeeds/stats19/compare/3f38d4cd9144497a0884c1c49c06f680b674acb7...master

Specifically:

Use spelling::spell_check_package() and correct the errors
- See https://github.com/ITSLeeds/stats19/issues/74
See https://github.com/ITSLeeds/stats19/commit/1d83af97d12c148d2ddd07d6753a0bebdfaff7e8
In the data.R file there are several examples that say something like: "The file names were generated as follows:" IMO any example like this needs to be in the data-raw in Rmd files and only the data in the final package with proper documentation as to what the data are and a link to where they were obtained from. A link to the data-raw files would be appropriate in the documentation for each data set.
- The examples have been changed to examples of using the data rather than getting the data.

The code to get police_boundaries (now in data-raw) has been updated to be more robust. It worked on my Ubuntu machine but suspect it failed on Mac. Maybe one for the sf issue tracker if so. In any case here's a reprex from the new code that should work on any OS:

library(sf)
#> Linking to GEOS 3.7.0, GDAL 2.3.2, PROJ 5.2.0
u = "https://opendata.arcgis.com/datasets/3e5a096a8c7c456fb6d3164a3f44b005_3.geojson"
police_boundaries_wgs = sf::st_read(u)
#> Reading layer `3e5a096a8c7c456fb6d3164a3f44b005_3' from data source `https://opendata.arcgis.com/datasets/3e5a096a8c7c456fb6d3164a3f44b005_3.geojson' using driver `GeoJSON'
#> Simple feature collection with 43 features and 9 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -6.402838 ymin: 49.86709 xmax: 1.761138 ymax: 55.81109
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
names(police_boundaries_wgs)
#>  [1] "objectid"       "pfa16cd"        "pfa16nm"        "bng_e"         
#>  [5] "bng_n"          "long"           "lat"            "st_areashape"  
#>  [9] "st_lengthshape" "geometry"
police_boundaries = st_transform(police_boundaries_wgs, 27700)
names(police_boundaries)
#>  [1] "objectid"       "pfa16cd"        "pfa16nm"        "bng_e"         
#>  [5] "bng_n"          "long"           "lat"            "st_areashape"  
#>  [9] "st_lengthshape" "geometry"
police_boundaries = police_boundaries[c("pfa16cd", "pfa16nm")]

^{Created on 2019-01-09 by the reprex package (v0.2.1)}

I think that's everything. Any further feedback, very welcome.

layik commented 5 years ago

Thanks @Robinlovelace. Thanks @adamhsparks, @sckott and @daranzolin one more time from me.

adamhsparks commented 5 years ago

Looks great! I've amended my review and recommend accepting this package to rOpenSci now.

ropensci / software-review

stats19 #266

Summary

Requirements

Publication options

Detail

Editor checks:

Editor comments

Prolegomena

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Responses to review 1 of stats19

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

Tests

DESCRIPTION File

README File(s)

Vignette

Meta

Response to 2nd round of reviews