ropensci / software-review

rOpenSci Software Peer Review.
292 stars 104 forks source link

stats19 #266

Closed Robinlovelace closed 5 years ago

Robinlovelace commented 5 years ago

Summary

The goal of stats19 is to make it easy to work with road crash data. Specifically it enables access to and processing of the UK’s official road traffic casualty database, which is called STATS19.

Package: stats19
Title: Work with open road traffic casualty data from Great Britain
Version: 0.1.0.9000
Authors@R: c(
    person("Robin", "Lovelace", email = "rob00x@gmail.com", role = c("aut", "cre"),
    comment = c(ORCID = "0000-0001-5679-6536")),
    person("Malcolm", "Morgan", email = "m.morgan1@leeds.ac.uk", role = c("aut")),
    person("Layik", "Hama", email = "layik.hama@gmail.com", role = c("aut"))
  )
Description: Tools to help process and analysie the UK road collision data also known as Stats19.
Depends: R (>= 3.5.0)
License: GPL-3
Encoding: UTF-8
LazyData: true
Imports: 
    sf,
    foreign,
    readr,
    dplyr,
    lubridate,
    readxl
Suggests: 
    knitr,
    rmarkdown,
    testthat
VignetteBuilder: knitr
RoxygenNote: 6.1.1

Academic, industry and public sector researchers investigating road safety, specifically people wanting to perform natural experimental studies to find out how best to make the transport system safer. It will also be of interest for people interested in large point-on-network data for methodological applications.

Previous packages/code-bases:

Requirements

Confirm each of the following by checking the box. This package:

Publication options

Detail

sckott commented 5 years ago

thanks for your submission @Robinlovelace We are discussing now and will get back to you soon

Robinlovelace commented 5 years ago

Great. It's in active development as you'll see. We have code to make all those millions of crash points sf data frames, we've just not pushed it. @layik has done great work refactoring the download functions so it makes it much easier to access this important dataset. Looking forward to hearing back from you and thanks for the update.

Robinlovelace commented 5 years ago

Heads-up @sckott format_sf() is now in there and the package now has a proper vignette. Are there any other things we should add at this stage for peer review? Thanks.

sckott commented 5 years ago

will dig into it today, i'll let you know

sckott commented 5 years ago

Editor checks:


Editor comments

Thanks for your submission @Robinlovelace !

Here's the output from goodpractice. No need to address these now, it's info for reviewers to use to get started.

── GP stats19 ─────────────
It is good practice to

  ✖ write unit tests for all functions, and all package code in general. 95% of code lines are covered by test cases.

    R/dl_stats19.R:82:NA
    R/dl_stats19.R:83:NA
    R/format.R:55:NA
    R/format.R:106:NA
    R/format.R:157:NA
    ... and 12 more lines

  ✖ add a "URL" field to DESCRIPTION. It helps users find information about your package online. If your package does not have a homepage, add an URL
    to GitHub, or the CRAN package package page.
  ✖ add a "BugReports" field to DESCRIPTION, and point it to a bug tracker. Many online code hosting services provide bug trackers for free,
    https://github.com, https://gitlab.com, etc.
  ✖ use '<-' for assignment instead of '='. '<-' is the standard, and R users and developers are used it and it is easier to read your code for them if
    you use '<-'.

    R/dl_stats19.R:28:12
    R/dl_stats19.R:30:9
    R/dl_stats19.R:31:16
    R/dl_stats19.R:32:11
    R/dl_stats19.R:34:18
    ... and 204 more lines

  ✖ avoid long code lines, it is bad for readability. Also, many people prefer editor windows that are about 80 characters wide. Try make your lines
    shorter than 80 characters

    R/dl_stats19.R:28:1
    R/dl_stats19.R:104:1
    R/dl_stats19.R:105:1
    R/format.R:42:1
    R/format.R:98:1
    ... and 14 more lines

  ✖ avoid sapply(), it is not type safe. It might return a vector, or a list, depending on the input data. Consider using vapply() instead.

    R/format.R:257:20

  ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) expressions. They are error prone and result 1:0 if the expression on
    the right hand side is zero. Use seq_len() or seq_along() instead.

    R/format.R:247:11

  ✖ fix this R CMD check NOTE: Namespaces in Imports field not imported from: ‘dplyr’ ‘foreign’ ‘lubridate’ All declared Imports should be used.

Seeking reviewers now 🕐


Reviewers:

Robinlovelace commented 5 years ago

Thanks @sckott. I'd seen goodpractice but never used it - that is very useful. After a few commits stats19 is improved. Latest results:

── GP stats19 ───────────────────────────────────────────────────────────────────────────

It is good practice to

  ✖ write unit tests for all functions, and all package code in
    general. 95% of code lines are covered by test cases.

    R/dl_stats19.R:85:NA
    R/dl_stats19.R:86:NA
    R/format.R:43:NA
    R/format.R:88:NA
    R/format.R:133:NA
    ... and 12 more lines

  ✖ use '<-' for assignment instead of '='. '<-' is the standard, and
    R users and developers are used it and it is easier to read your code for
    them if you use '<-'.

    R/dl_stats19.R:28:12
    R/dl_stats19.R:33:9
    R/dl_stats19.R:34:16
    R/dl_stats19.R:35:11
    R/dl_stats19.R:37:18
    ... and 203 more lines
sckott commented 5 years ago

Reviewers assigned:

daranzolin commented 5 years ago

Prolegomena

Thanks to rOpenSci for the invitation to review, and special thanks to Robin, Malcolm, Layik, and Mark for this important package. I'm already a huge fan.


Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

I would tease a bit more of what's in these data sets. I wasn't entirely sure until I downloaded and opened the supporting documentation. If I were searching for this kind of data, and I didn't know what STATS19 was, I'd like to know I'm in the right place after scanning the README. Maybe a map?

I couldn't load the vignette from the console:

vignette(package = "stats19")
#> no vignettes found

Admittedly, I also couldn't load the vignette for my own rOpenSci package, so I'm not sure what's missing.

Several of the examples failed:

stats19::read_vehicles()
#> Error in if (year %in% 1980:2003) {: argument is of length zero

I couldn't find any explicit contributing guidelines in the README, and there is no CONTRIBUTING document.

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

There is no paper.md.

Functionality

Test results:


══ Results ═════════════════════════════════════════════════════════════════════════════════════════════════
Duration: 24.2 s

OK: 37 Failed: 0 Warnings: 0 Skipped: 0


- [X] **Packaging guidelines**: The package conforms to the rOpenSci packaging guidelines

#### Final approval (post-review)

- [x] **The author has responded to my review and made changes to my satisfaction. I recommend approving this package.**

Estimated hours spent reviewing: 3.5

---

### Review Comments

> A superb and essential package--we need this data and we need it in these formats. The download-format-read-explore workflow is intuitive and relatively frictionless. I have only some brief comments:

* I wonder you could possibly merge the formatting and reading step with a `raw = TRUE` or `format = TRUE` argument in the `read_*` functions. But perhaps that's my tendency towards abstraction. Something like `ac = read_accidents(year = 2017, format = TRUE)`

* My personal preference would be to have the schema from `dl_schema` lazily loaded with the package.

* [According to the vignette,](https://github.com/ITSLeeds/stats19/blob/master/vignettes/stats19.Rmd) the `dl_*` functions are interactive, although the interactivity is commented out in the code. Will the interactivity be returning? Or does the vignette need to be updated?

* Out of curiosity, what's happening with https://github.com/cyipt/stats19? It was updated recently.

* I confess I wish the package name was more expressive--stats19 sounds like an introductory statistics class. 

* This data will be used to make many maps. I personally would love a nudge in that direction in either the README or the vignette. 
sckott commented 5 years ago

thanks very much for your review @daranzolin

mpadge commented 5 years ago

Yeah, thanks @daranzolin for an impressively prompt review. I'll leave the response proper to @layik and @Robinlovelace, but will confess guilty here to having commented out the interactivity in the dl_* functions. That's all my doing and I should already have reinstated it. Apologies to all, and yes, it will be returning.

Robinlovelace commented 5 years ago

Many thanks for the quick yet detailed review @daranzolin. I'm working on it this week and will get back to you on each of those constructive comments asap. More soon...

layik commented 5 years ago

Great review @daranzolin. Appreciate it. Thanks @Robinlovelace and @mpadge. Apologies I just missed this thread for some reason.

Robinlovelace commented 5 years ago

Heads-up all (especially @daranzolin, @mpadge and @layik - the originator of interactive downloads in stats19), I've re-added interactivity, this time using menu() which is best practice for this kind of thing and fewer lines of code.

Please check out the video to see how it works here:

asciicast

And if you're interested in the code used to do this, see here: https://github.com/ITSLeeds/stats19/pull/36

Feedback on behaviour or code welcome before I merge this. Many thanks @layik for implementing this 1st time around.

layik commented 5 years ago

Just going through the review @daranzolin, RE vignette, I think what you needed to do was

devtools::install(build_vignettes = TRUE)

then you would be able to load the vignette.

Robinlovelace commented 5 years ago

Heads-up we've addressed most of the issues raised by @daranzolin I think. Many thanks for a great set of comments. We're documenting them here https://github.com/ITSLeeds/stats19/issues/49 and plan to put the review responses here https://github.com/ITSLeeds/stats19/blob/master/responses1.Rmd

Not quite finished but plan to properly respond in a comment here by Saturday this week - sorry it's taken a while.

layik commented 5 years ago

Hi @daranzolin, as promised by @Robinlovelace find our responses and thank you very much for taking time to do the review and please feel free to let us know if you have any further feedback.

=====

Responses to review 1 of stats19

Thanks for the review. We've had a chance, after making some changes and fixes to the package, to take-in and act on each of the comments. The code-base has evolved substantially since the review, but the fundamental design of the package, with its 3 stage API mirroring workflows that happened before the package was developed, remains unchanged. That is:

dl_stats19(year = 2017)
# Multiple matches. Which do you want to download?
# 
# 1: dftRoadSafetyData_Vehicles_2017.zip
# 2: dftRoadSafetyData_Casualties_2017.zip
# 3: dftRoadSafetyData_Accidents_2017.zip
dl_stats19(year = 2017, type = "ac")
# Files identified: dftRoadSafetyData_Accidents_2017.zip
# 
# Wanna do it (y = enter, n = esc)? 
dl_stats19(year = 1985)
# Year not in range, changing to match 1979:2004 data
# This file is over 240 MB in size.
# Once unzipped it is over 1.8 GB.
# Files identified: Stats19-Data1979-2004.zip
# 
# Wanna do it (y = enter, n = esc)?

Note: the last two stages is now combined by default as per this review read_*(format = TRUE) by default.

We'll focus on areas flagged in the review for the rest of this response:

I would tease a bit more of what's in these data sets. I wasn't entirely sure until I downloaded and opened the supporting documentation. If I were searching for this kind of data, and I didn't know what STATS19 was, I'd like to know I'm in the right place after scanning the README. Maybe a map?

We have added a map (well technically 9 maps!) and a couple of time series plots showing the scale of the data. Also show a sample of the additional casualty and vehicle tables has been added to show more clearly the richness of data provided.

I couldn't load the vignette from the console:

We also could not see the vignette when installing using devtools::install_github(build_vignettes = TRUE. But we can see the vignette if we install locally.

This was the code we ran:

devtools::install(build_vignettes = TRUE)
## Installing stats19
vignette(package = "stats19")

Several of the examples failed:

These have now been fixed - thanks for testing and reporting.

I couldn't find any explicit contributing guidelines in the README, and there is no CONTRIBUTING document.

A CONTRIBUTING is added now. Thank you.

The package has an obvious research application according to JOSS's definition

There is no paper.md.

One is added with:

Review Comments

A superb and essential package--we need this data and we need it in these formats. The download-format-read-explore workflow is intuitive and relatively frictionless. I have only some brief comments:

Thank you.

I wonder you could possibly merge the formatting and reading step with a raw = TRUE or format = TRUE argument in the read_* functions. But perhaps that's my tendency towards abstraction. Something like ac = read_accidents(year = 2017, format = TRUE)

Done, appreciate your input.

My personal preference would be to have the schema from dl_schema lazily loaded with the package.

DESCRIPTION: has the line LazyData which means stats19_schema is lazy loaded.

According to the vignette, the dl_* functions are interactive, although the interactivity is commented out in the code. Will the interactivity be returning? Or does the vignette need to be updated?

Back in, as stated above.

Out of curiosity, what's happening with https://github.com/cyipt/stats19? It was updated recently.

@mem48 answered this: cyipt/stats19 is not actually a proper R package. It is a repo containing scripts for CyIPT project, it has different sources (UK DS), and usage so there is no current need to adapt the use to this package. Malcolm is one of the contributors to this package.

I confess I wish the package name was more expressive--stats19 sounds like an introductory statistics class.

This a reasonable point that we have thought of and discussed. We are open minded about changing the name but, as with so many things, there are +s and -s (outlined for some options below):

The main benefit we can see of changing the name would be making the package easier to find. We think good documentation and clear description and some write-ups of the package and what it does could address these issues. We've explored stat19 name and it links directly to (and is almost synonymous with) road crash data. See https://en.wikipedia.org/wiki/STATS19 for an excellent example (we plan to add this link to the README)

so the name is OK for we think, but we're open minded to alternative names mentioned above and perhaps names we've not thought of.

This data will be used to make many maps. I personally would love a nudge in that direction in either the README or the vignette.

Definitely. Thank you very much for your input.

Robinlovelace commented 5 years ago

Quick question for @sckott, what is you policy regarding CRAN submission while it's under review? We may change this in response the the answer (or we submit tomorrow!): https://github.com/ITSLeeds/stats19/milestone/1.

Look forward to your response @adamhsparks.

daranzolin commented 5 years ago

Thanks @Robinlovelace @layik @mpadge -- the new vignette and README looks great. @sckott, all of my comments have been addressed.

One last observation: I'm getting parsing failures on the Accident_Index column with both read_vehicles and read_casualties:

> vehicles = read_vehicles()
Warning: 298687 parsing failures.
row       col               expected  actual                                                                                                                       file
  1 Acc_Index no trailing characters BS70001 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
  2 Acc_Index no trailing characters BS70001 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
  3 Acc_Index no trailing characters BS70002 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
  4 Acc_Index no trailing characters BS70002 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
  5 Acc_Index no trailing characters BS70003 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Vehicles_2009/ [... truncated]
Warning message:
The following named parsers don't match the column names: Accident_Index 
> casualties = read_casualties()
Warning: 222146 parsing failures.
row       col               expected  actual                                                                                                                           file
  1 Acc_Index no trailing characters BS70001 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
  2 Acc_Index no trailing characters BS70002 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
  3 Acc_Index no trailing characters BS70002 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
  4 Acc_Index no trailing characters BS70002 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
  5 Acc_Index no trailing characters BS70002 '/var/folders/j4/lb613tmj3_xd5jj5n11dprz04kdcr4/T//RtmpWy104K/DfTRoadS [... truncated]
Warning message:
The following named parsers don't match the column names: Accident_Index 

Is it because the actual column name is acc_index instead of Accident_Index?

Robinlovelace commented 5 years ago

Thanks @daranzolin, which years do you get parsing failures for? Cannot reproduce for 2017 data, as illustrated in the reprex below:

devtools::install_github("ITSLeeds/stats19")
#> Skipping install of 'stats19' from a github remote, the SHA1 (225bce03) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(stats19)
#> Data provided under the conditions of the Open Government License.
#> If you use data from this package, mention the source
#> (Department for Transport), cite the package and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/.

# casualties
dl_stats19(year = 2017, type = "casualties")
#> Files identified: dftRoadSafetyData_Casualties_2017.zip
#> Attempt downloading from:
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2017.zip
#> Data saved at /tmp/Rtmpw3fpFE/dftRoadSafetyData_Casualties_2017/Cas.csv
ca = read_casualties(year = 2017)
head(ca$accident_index)
#> [1] "2017010001708" "2017010001708" "2017010001708" "2017010009342"
#> [5] "2017010009344" "2017010009348"

# vehicles
dl_stats19(year = 2017, type = "vehicles")
#> Files identified: dftRoadSafetyData_Vehicles_2017.zip
#> Attempt downloading from:
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Vehicles_2017.zip
#> Data saved at /tmp/Rtmpw3fpFE/dftRoadSafetyData_Vehicles_2017/Veh.csv
ve = read_vehicles(year = 2017)
head(ve$accident_index)
#> [1] "2017010001708" "2017010001708" "2017010009342" "2017010009342"
#> [5] "2017010009344" "2017010009344"

Created on 2018-12-21 by the reprex package (v0.2.1)

daranzolin commented 5 years ago

It was the 2009 datasets for vehicles and casualties. Apologies for the missing reprex.

devtools::install_github("ITSLeeds/stats19")
#> Skipping install of 'stats19' from a github remote, the SHA1 (53a82fe9) has not changed since last install.
#>   Use `force = TRUE` to force installation
#> Skipping install of 'stats19' from a github remote, the SHA1 (225bce03) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(stats19)
#> Data provided under the conditions of the Open Government License.
#> If you use data from this package, mention the source
#> (UK Department for Transport), cite the package, and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/.
#> Data provided under the conditions of the Open Government License.
#> If you use data from this package, mention the source
#> (Department for Transport), cite the package and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/.

# casualties
dl_stats19(year = 2009, type = "casualties")
#> Files identified: DfTRoadSafety_Casualties_2009.zip
#> Attempt downloading from:
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2009.zip
#> Data saved at /var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv
#> Files identified: dftRoadSafetyData_Casualties_2017.zip
#> Attempt downloading from:
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2017.zip
#> Data saved at /tmp/Rtmpw3fpFE/dftRoadSafetyData_Casualties_2017/Cas.csv
ca = read_casualties(year = 2009)
#> Warning: The following named parsers don't match the column names:
#> Accident_Index
#> Warning: 222146 parsing failures.
#> row       col               expected  actual                                                                                                                           file
#>   1 Acc_Index no trailing characters BS70001 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
#>   2 Acc_Index no trailing characters BS70002 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
#>   3 Acc_Index no trailing characters BS70002 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
#>   4 Acc_Index no trailing characters BS70002 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
#>   5 Acc_Index no trailing characters BS70002 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv'
#> ... ......... ...................... ....... ..............................................................................................................................
#> See problems(...) for more details.
head(ca$accident_index)
#> Warning: Unknown or uninitialised column: 'accident_index'.
#> NULL

# vehicles
dl_stats19(year = 2009, type = "vehicles")
#> Files identified: DfTRoadSafety_Vehicles_2009.zip
#> Attempt downloading from:
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Vehicles_2009.zip
#> Data saved at /var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv
#> Files identified: dftRoadSafetyData_Vehicles_2017.zip
#> Attempt downloading from:
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Vehicles_2017.zip
#> Data saved at /tmp/Rtmpw3fpFE/dftRoadSafetyData_Vehicles_2017/Veh.csv
ve = read_vehicles(year = 2009)
#> Warning: The following named parsers don't match the column names:
#> Accident_Index
#> Warning: 298687 parsing failures.
#> row       col               expected  actual                                                                                                                       file
#>   1 Acc_Index no trailing characters BS70001 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
#>   2 Acc_Index no trailing characters BS70001 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
#>   3 Acc_Index no trailing characters BS70002 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
#>   4 Acc_Index no trailing characters BS70002 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
#>   5 Acc_Index no trailing characters BS70003 '/var/folders/1f/2ds51cyj353cmqcy419df_6m0000gn/T//Rtmp2IlBw0/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv'
#> ... ......... ...................... ....... ..........................................................................................................................
#> See problems(...) for more details.
head(ve$accident_index)
#> Warning: Unknown or uninitialised column: 'accident_index'.
#> NULL

Created on 2018-12-21 by the reprex package (v0.2.1)

layik commented 5 years ago

Hi David. I will have a look at 2009 and see what the issue is and report.

Robinlovelace commented 5 years ago

Hi @daranzolin thanks loads for reporting. I think it's fixed on this branch which I'm planning to merge, as demonstrated in this reprex:

devtools::install_github("ITSLeeds/stats19", ref = "index")
library(stats19)
#> Data provided under the conditions of the Open Government License.
#> If you use data from this package, mention the source
#> (UK Department for Transport), cite the package, and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/.
dl_stats19(year = 2009, type = "casualties")
#> Files identified: DfTRoadSafety_Casualties_2009.zip
#> Attempt downloading from:
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2009.zip
#> Data saved at /tmp/RtmpFsdiJv/DfTRoadSafety_Casualties_2009/DfTRoadSafety_Casualties_2009.csv
ca = read_casualties(year = 2009)
head(ca$accident_index)
#> [1] "200901BS70001" "200901BS70002" "200901BS70002" "200901BS70002"
#> [5] "200901BS70002" "200901BS70002"
dl_stats19(year = 2009, type = "vehicles")
#> Files identified: DfTRoadSafety_Vehicles_2009.zip
#> Attempt downloading from:
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Vehicles_2009.zip
#> Data saved at /tmp/RtmpFsdiJv/DfTRoadSafety_Vehicles_2009/DfTRoadSafety_Vehicles_2009.csv
ve = read_vehicles(year = 2009)
head(ve$accident_index)
#> [1] "200901BS70001" "200901BS70001" "200901BS70002" "200901BS70002"
#> [5] "200901BS70003" "200901BS70003"

Created on 2018-12-23 by the reprex package (v0.2.1)

Robinlovelace commented 5 years ago

Update: index branch has now been merged.

sckott commented 5 years ago

@adamhsparks your review is due in 1 week, thanks very much

sckott commented 5 years ago

@Robinlovelace we usually don't have maintainers submit to CRAN during the review but if its already on CRAN and there is a bug fix patch, then seems justifiable to submit to CRAN during review

adamhsparks commented 5 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • [x] A short summary describing the high-level functionality of the software
  • [x] Authors: A list of authors with their affiliations
  • [x] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [x] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 7.5


Review Comments

Clearly much effort has gone into this package. I greatly support the sentiment behind making these data available in R having done the same for other data sets, myself. This package should be a great benefit to researchers in this area. I really appreciate the slim dependencies. It will make this package much easier to maintain into the future.

I found the package to be well documented, the vignette is helpful in illustrating some use cases for the data along with how to access it and the code is clear.

Some of the functionality I find to be mildly confusing like downloading and then importing the files into the R session and then formatting. As a user I'd prefer it all in one step, but there are likely use cases I'm not aware of that mean that this is useful so some examples of this would be useful I think.

My general comments on the code follow and sections for the DESCRTIPTION and Vignette as well. I've commented quite a bit on grammar and spelling as I think that the polish of a package is important as it lends to the perception of the quality.

Thanks for inviting me to review this. It's been fun and I've learned some new things.


Download now (y = enter, n = esc)?

Warning message: In find_file_name(years = year, type = type) : Coordinates unreliable in this data.



- I got caught out when using the interactive features. I read "y = enter" but hit "y" thinking that would work as well as hitting "enter", but R cancelled the operation anyway just as if I'd hit "esc"

- Per a recent conversation with CRAN, you should use `donttest()` rather than `dontrun()` for examples you don't want to be run on CRAN. Then set .travis.yml to run them by using `r_check_args: --as-cran --run-donttest`. **This may not be appropriate in all cases, e.g. interactive functions.**

- When validating user inputs and using `stop()` it's nice to use `call. = FALSE` to simplify the error message that the user receives.

- Consider using [`hoardr`](https://ropensci.github.io/hoardr/) for managing user-saved files on disk that aren't in `tempdir()`?

- When using `utils::download.file()`, you should use `mode = "wb"` or Windows users may end up with corrupted downloads in my experience. `curl::curl_download()` does the same thing but uses more updated ways of doing it and defaults to using a binary mode (wb).

- I don't think that there is much need for the `Attempting download from` or `Reading in: ` message. If it takes that long, I would suggest to use a progress bar to show progress. But this is just a personal observation.

- Consider setting up a `pkgdown` site? It's easy to do and you can automate deployment with your Travis-CI so it's less to remember.

#### Tests

- I'm unclear how the interactive portion of the package functions is handled in testing? There are ways to handle this, but I don't see any implemented and when I run `devtools::test()` I'm asked to provide my own input.

- Suggest using `skip_on_cran()` since some of the tests can take some time to execute due to download times.

#### DESCRIPTION File

- In the DESCRIPTION file, Mark's author entry is missing his ORCID.

- More information in the DESCRIPTION's Description field would be desirable, a link to the data's website or other information to give more background perhaps.

- STATS19 should be in "'" in DESCRIPTION for CRAN, i.e., 'STATS19', I think.

- Check spelling in DESCRIPTION file, see: "analysie".

- The Description should include a link to the DfT website.

- Language field should be set, `Language: en-GB`.

#### README File(s)

- Use `remotes::install_github()` in place of `devtools::install_github()` in README.

- The code style is inconsistent in the README.Rmd file in the code chunks, e.g. line 85 is missing space around `=`.

- The example in the README showing two steps seems necessarily confusing to new users. If there is a good reason for having the raw data in R, document in a vignette why this is useful and show the two-step process, but if normal users won't do this, I wouldn't show it in the quick-start.

- Line 43 of README uses inconsistent "(" around the phrases with the other two `read_*` function description.

#### Vignette

- Run spell-check on it.

- The term "attach"" has a specific meaning in R. Suggest rewording the portion about installation and loading the package to omit the use of "attach", since you're not using `attach()` in the R sense (and really shouldn't use it anyway).

- I would describe why a user might want or need to install the Development version from GitHub in the vignette. Presumably if they are reading the vignette, they've already installed the package from CRAN (in the future).

- Try to consistently use `function()` to identify functions in the vignette text. This also means that if/when you use pkgdown to build a site, the functions are linked to the help file.

- In the introduction, the description of why there are `read_*()` and `format_*()` functions is confusing. To me, it reads as if `format` is only a parameter for `read_*()` in the introduction. I was left wondering why it's documented there or why the `format_*()`s even exist until I reached the end of the vignette.

- There is a comma out of place in Vignette,

>Format: Each of the read_*() functions has a format parameter which, when TRUE, adds

should be 

>Format: Each of the read_*() functions has a format parameter, which, when TRUE, adds 

- I'm unsure about including a package that's not on CRAN in the vignette (`ukboundaries`), something like this should be listed in Suggests, but it's not on CRAN, @sckott do you have any thoughts?

- The first figures in the `sf` section after the join aren't immediately clear to me. The axis lack labels, I'm not really sure what I'm looking at.

#### Meta

- The contributing guidelines mention a `pkgdown` website, this does not exist.

- The JOSS paper needs spelling check run on it to correct spelling errors.

This review is also hosted as a repository on my GitHub, https://github.com/adamhsparks/stats19-review
adamhsparks commented 5 years ago

Also, as I've been working on this, I've modified the DESCRIPTION file to fix some errors, and address most of the points I raised. I've also run spellcheck on the vignette if you'd accept a PR.

sckott commented 5 years ago

thanks for your review @adamhsparks !

layik commented 5 years ago

That is a great review @adamhsparks. Thank you. Thanks @sckott appreciate your follow up!

adamhsparks commented 5 years ago

Sorry, I missed the paper.md in the root dir. I've updated my review to reflect this.

Robinlovelace commented 5 years ago

Hi @adamhsparks no problem, and just saw the PR. Will look now. Meanwhile, I've put our work-in-progress response to your review here: https://github.com/ITSLeeds/stats19/issues/63 - any questions, please ask. Very grateful for the level of detail and knowledge in the review. Thank you!

Robinlovelace commented 5 years ago

One question, probably more for @sckott: does JOSS accept figures and code chunks? Thinking that could liven it up a bit...

Robinlovelace commented 5 years ago

OK just merged those 2 mini PRs. That raises another question: are both reviewers happy to be in the author list with the new rev tag? More info here: https://ropensci.org/blog/2018/03/16/thanking-reviewers-in-metadata/

adamhsparks commented 5 years ago

Hi @Robinlovelace, happy to be listed as a reviewer, "rev", and contributor, "ctb", for the pull requests you've accepted. My ORCID: 0000-0002-0061-8359.

daranzolin commented 5 years ago

@Robinlovelace rev is great for me! Thanks for asking.

Robinlovelace commented 5 years ago

Great, I've add both reviewers in https://github.com/ITSLeeds/stats19/pull/67/files

sckott commented 5 years ago

@Robinlovelace will ask, not sure about the JOSS question

sckott commented 5 years ago

AFAICT figures can be included https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain - it doesn't say anything about code chunks, but you could try

Robinlovelace commented 5 years ago

Happy to say, we've finally responded to all the points made by @adamhsparks. We've learned a lot in the process, and have added an auto deploy website (see https://itsleeds.github.io/stats19/ ), plus various current best practice things like donttest that we didn't know about. Thanks loads!

Any further comments/suggestions: very welcome.

Overall, we found all the suggestions sensible, linking to previous discussions about combining the 3 stage (dl, read, format) process into a single function call, which we were thinking of calling get_stats19(). The reason for splitting the process up is to ensure maximum transparency and to give the user control over what the package is doing. However, as long as it is properly documented, we think the benefits of a get_stats19() function will outweigh any possible negatives we can think of. get_stats19() has now been added!

Thanks for the comments, we have indeed tried to keep dependencies to a minimum but consider readr worthwhile. readxl and tibble have been removed. curl has been demoted to Suggests, as detailed in another comment.

Thanks. If you think of other was we can communicate the value of the data, do let us know (I think the second mapping figure could be improved...).

We have long been planning to add a get_stats19() function as per https://github.com/ITSLeeds/stats19/issues/11 The review comment, combined with further discussion, has triggered us to re-prioritise it. It's been beneficial to polish each of the component functions first, however, and good to document each stage for maximum transparency, however, so we plan to keep the dl, read and format functions exported.

Agreed.


The guidance is to 'Only use package startup messages when necessary'. A case can be made that this is necessary. As with osmdata, the package provides access to data that has a license that requires it to be cited. The osmdata load message is as follows:

library(osmdata)

We fully agree with the reasoning behind remove package startup messages however. As a compromise, we've shortened the startup from 4 lines to 2:

# before:
# Data provided under the conditions of the Open Government License.
# If you use data from this package, mention the source
# (UK Department for Transport), cite the package, and link to:
# www.nationalarchives.gov.uk/doc/open-government-licence/version/3/.
# after:
library(stats19)
Data provided under OGL v3.0. Cite the source and link to:
www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

running goodpractice::gp() found the following lines with > 80 lines:

    R/format.R:62:1
    R/format.R:67:1
    R/read.R:141:1
    R/utils.R:167:1

All these have been fixed.

Kindly make sure you build before running check the resultinggz` archive.

curl is used in the tests and readxl is used in the examples. These have been demoted to Suggests. tibble has been removed from the DESCRIPTION file.

See https://github.com/ITSLeeds/stats19/pull/68

Acronyms have been explained and better formatting of 'STATS19' (note, it's capitalised when referring to the data, but lower case when referring to the package), and a few other improvements to the docs, have been made. See https://github.com/ITSLeeds/stats19/commit/e4ab09aa716630322aadd26dee3465258a2e58a0

@seealso tags have now been added to a few key documentation pages.

Yes, but I think the costs of doing this would outweigh the benefits (RL)

Fixed in https://github.com/ITSLeeds/stats19/commit/777479b9234be893a9946c787ef23cc08ff6a41a

Fixed in https://github.com/ITSLeeds/stats19/commit/aec13fc0673717d68a3eb159c066d76a8d0645ca

dl_stats19(year = 1979, type = "deaths")
No files of that type found for that year.
This will download 240 MB+ (1.8 GB unzipped).
Files identified: Stats19-Data1979-2004.zip

Download now (y = enter, n = esc)? 

Warning message:
In find_file_name(years = year, type = type) :
  Coordinates unreliable in this data.

Now produces the following output: Error in dl_stats19(year = 1979, type = "deaths") : Stopping as requested

Agreed, that was an issue. Fixed in the latest version (tested with > stats19::dl_stats19(year = 1979))

Trying this, as per: https://github.com/ITSLeeds/stats19/pull/69

Add as requested. See https://github.com/ITSLeeds/stats19/pull/70

We have considered this and have taken a good look at the package. In this case, however, I don't think it's appropriate: saving the 1/4 GB .zip file associated with the 1979:2005 data, for example, does not make sense when it can be extracted and saved in a more useful format. We assume the user will extract and save the data they use. Furthermore, the DfT is trialling an API which should make the desire to save intermediate files even rarer. Users are told where the files are saved, so are free to move them where they like.

We have tested on Windows and have not found any issues with the dl_stats19() function - so treating this as a 'wontfix', unless a strong reason can be found for adding this.

There are read and download progress bars for large datasets. Try this, for example:

dl_stats19(year = 1979)

A site has been added: https://itsleeds.github.io/stats19/

I'm testing auto-deploy on travis and will happily merge this branch if it works (RL): https://github.com/ITSLeeds/stats19/pull/71

Update: auto deploy failed so reverting that change to .travis.yml. pkgdown that was making it fail. See https://travis-ci.org/ITSLeeds/stats19/builds/475415270#L1622 And: https://ropenscilabs.github.io/travis/

Is this is an ropensci pkg issue and is it worth asking for support?

Tests

When R is running non-interactively it auto-downloads the necessary files or produces an error. All tests involving downloads work on our computers and Travis. They can be turned off for the duration of your session with:

Sys.setenv(DONT_DOWNLOAD_ANYTHING = "true")

Not sure if that answers the question - any further feedback on this: welcome.

Added in https://github.com/ITSLeeds/stats19/pull/72

DESCRIPTION File

See fixes here: https://github.com/ITSLeeds/stats19/commit/fdeb634dbab5a39ab599b35693623d0b1c94f71d

README File(s)

Vignette

should be

This is a good point. Fixed, by adding a much more useful dataset, representing the juristictions of police forces across England and Wales: police_boundaries.

Agreed. The maps were not particularly clear. An overview map has been added to the vignette. For context, and to ensure README-vignette harmony, we also changed the README's mapping section. It now also contains an overview map so people know what they are looking at, but not the facetted map, which was confusing.

Meta

layik commented 5 years ago

I too, learned a lot from this review @sckott, @daranzolin and @adamhsparks. I really appreciate all your contributions. Fantastic open source vibes.

Thanks @Robinlovelace and @mpadge awesome work.

adamhsparks commented 5 years ago

Thanks, @Robinlovelace, @layik and @mpadge. I'll have a look at the changes shortly and respond.

adamhsparks commented 5 years ago

Thank you @Robinlovelace, @layik and @mpadge for your consideration of my comments and the incorporation of many of them as you see fit. However, there are still a few issues (most of which you did attempt to previously address but seem to have slipped through) that I think should be fixed before I sign off on it.

  1. Use spelling::spell_check_package() and correct the errors.

  2. There's a line >80 chars. Not a hard-fast rule, but nice to observe and this one can be easily split over two lines. It is Line 149 of read.R.

  3. In the data.R file there are several examples that say something like: "The file names were generated as follows:" IMO any example like this needs to be in the data-raw in Rmd files and only the data in the final package with proper documentation as to what the data are and a link to where they were obtained from. A link to the data-raw files would be appropriate in the documentation for each data set.

From Hadley's R packages - Data page: http://r-pkgs.had.co.nz/data.html

Often, the data you include in data/ is a cleaned up version of raw data you’ve gathered from elsewhere. I highly recommend taking the time to include the code used to do this in the source version of your package. This will make it easy for you to update or reproduce your version of the data. I suggest that you put this code in data-raw/. You don’t need it in the bundled version of your package,...

Emphasis mine.

There is also an error in the examples for police_boundaries. I cannot reproduce the data set as you have in your example.

  > library(sf)
  Linking to GEOS 3.7.1, GDAL 2.3.2, PROJ 5.2.0
  > u = paste0("https://opendata.arcgis.com/",
  +   "datasets/3e5a096a8c7c456fb6d3164a3f44b005_3.kml"
  +   )
  > police_boundaries_wgs = read_sf(u)
  Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  :
    GDAL Error 1: JSON parsing error: unexpected character (at offset 0)
  > police_boundaries = st_transform(police_boundaries_wgs, 27700)
  > police_boundaries = police_boundaries[c("pfa16cd", "pfa16nm")]
  Error: Can't find columns `pfa16cd`, `pfa16nm` in `.data`.
  Execution halted

The warning with GDAL can be circumvented if I first download the file then import it? Not that it matters, the error still occurs since there are no columns named pfa16cd or pfa16nm in the data frame.

adamhsparks commented 5 years ago

One other suggestion for you is to add a CITATION file so that citation("stats19") can be used to generate a reference for citing it from R.

Robinlovelace commented 5 years ago

These all sound reasonable @adamhsparks. We'll get on it and hope to 'resubmit' by early next week. Note to self: add a link to show changes that have happened between now and then: https://help.github.com/articles/comparing-commits-across-time/

adamhsparks commented 5 years ago

If it helps, here's the CITATION file for nasapower that has a JOSS article and a manual (to give the release version) both in it as well as info on citing the POWER data properly.

https://github.com/ropensci/nasapower/blob/master/inst/CITATION

layik commented 5 years ago

Thanks @adamhsparks! Timely followup from @Robinlovelace as always.

Robinlovelace commented 5 years ago

Great, thanks for showing us this example. I imagine the citation will change if/when it goes into JOSS.

Robinlovelace commented 5 years ago

Response to 2nd round of reviews

Many thanks for providing follow-up comments, they have been really useful and led to further improvements in the package. All the changes in response to them can be found here: https://github.com/ITSLeeds/stats19/compare/3f38d4cd9144497a0884c1c49c06f680b674acb7...master

Specifically:

  1. Use spelling::spell_check_package() and correct the errors

  2. See https://github.com/ITSLeeds/stats19/commit/1d83af97d12c148d2ddd07d6753a0bebdfaff7e8

  3. In the data.R file there are several examples that say something like: "The file names were generated as follows:" IMO any example like this needs to be in the data-raw in Rmd files and only the data in the final package with proper documentation as to what the data are and a link to where they were obtained from. A link to the data-raw files would be appropriate in the documentation for each data set.

    • The examples have been changed to examples of using the data rather than getting the data.

The code to get police_boundaries (now in data-raw) has been updated to be more robust. It worked on my Ubuntu machine but suspect it failed on Mac. Maybe one for the sf issue tracker if so. In any case here's a reprex from the new code that should work on any OS:

library(sf)
#> Linking to GEOS 3.7.0, GDAL 2.3.2, PROJ 5.2.0
u = "https://opendata.arcgis.com/datasets/3e5a096a8c7c456fb6d3164a3f44b005_3.geojson"
police_boundaries_wgs = sf::st_read(u)
#> Reading layer `3e5a096a8c7c456fb6d3164a3f44b005_3' from data source `https://opendata.arcgis.com/datasets/3e5a096a8c7c456fb6d3164a3f44b005_3.geojson' using driver `GeoJSON'
#> Simple feature collection with 43 features and 9 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -6.402838 ymin: 49.86709 xmax: 1.761138 ymax: 55.81109
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
names(police_boundaries_wgs)
#>  [1] "objectid"       "pfa16cd"        "pfa16nm"        "bng_e"         
#>  [5] "bng_n"          "long"           "lat"            "st_areashape"  
#>  [9] "st_lengthshape" "geometry"
police_boundaries = st_transform(police_boundaries_wgs, 27700)
names(police_boundaries)
#>  [1] "objectid"       "pfa16cd"        "pfa16nm"        "bng_e"         
#>  [5] "bng_n"          "long"           "lat"            "st_areashape"  
#>  [9] "st_lengthshape" "geometry"
police_boundaries = police_boundaries[c("pfa16cd", "pfa16nm")]

Created on 2019-01-09 by the reprex package (v0.2.1)

I think that's everything. Any further feedback, very welcome.

layik commented 5 years ago

Thanks @Robinlovelace. Thanks @adamhsparks, @sckott and @daranzolin one more time from me.

adamhsparks commented 5 years ago

Looks great! I've amended my review and recommend accepting this package to rOpenSci now.