Closed adamhsparks closed 7 years ago
@sckott, I've found a bug that I need to fix.
Not sure how it's crept in. I will report back when it's corrected.
Well that was embarrassing. It's all fixed now, @sckott.
Glad I checked the package build again this morning.
Thanks for your submission @adamhsparks - no worries about eth bug
Will be seeking reviewers, but in the meantime: I ran goodpractice::gp()
on your package. Here are the results, which it may be helpful to get started with before review (you can run it yourself and dig into the results more):
It is good practice to
✖ write unit tests for all functions, and all package code
in general. At least some lines are not covered in this package.
R/get_GSOD.R:230:NA
R/get_GSOD.R:231:NA
R/get_GSOD.R:234:NA
R/get_GSOD.R:235:NA
R/get_GSOD.R:238:NA
... and 359 more lines
✖ omit "Date" in DESCRIPTION. It is not required and it
gets invalid quite often. A build date will be added to the package
when you perform `R CMD build` on it.
✖ avoid long code lines, it is bad for readability. Also,
many people prefer editor windows that are about 80 characters
wide. Try make your lines shorter than 80 characters
R/get_GSOD.R:283:1
R/get_GSOD.R:284:1
R/get_GSOD.R:287:1
R/get_GSOD.R:306:1
R/get_GSOD.R:456:1
... and 9 more lines
Description
field in the DESCRIPTION
file is too detailed. Please make it more concise. Reviewers: @jeffreyhanson @deniederhut Due date: 2016-11-08
Thanks for the valuable feedback. This is exactly the type of suggestions I was looking for here.
I'll get as much of this done as I can shortly.
I've updated the DESCRIPTION file in the package and here in this issue as suggested.
Vignette builds faster now, as suggested as well.
Tomislav Hengl has contributed to the package and is now an author. I'll be merging his commits as I have time. However, I'll be on travel this week. So it may take me a while to address this.
Thanks for the updates @adamhsparks - seeking reviewers
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
URL
, Maintainer
and BugReports
fields in DESCRIPTION The package contains a paper.md
with:
Estimated hours spent reviewing: 3
The Global Summary Of The Day dataset contains weather data collected from meteorological stations all over the planet. However, accessing and formatting this data for use in R can be rather tricky. The GSODR R package will make this a dream. This package will be incredibly useful for new R users who want easy access to this data set. It will also be useful for experienced R users who want to download data over large extents and do post-processing later in R. I very much look forward to using this package in the future.
I have several major concerns before I can recommend this package for release. First, the main function get_GSOD
fails when the user wants to download data for an entire country. Personally, I think this is the biggest feature of this package. Second, the package could benefit from more automated testing. None of the examples are actually tested, and the unit tests only cover cases that are expected to fail. Third, the package only downloads data. It would be brilliant if the get_GSOD
function returned data after completion. I have outlined these main issues and other issues below.
get_GSOD
function cannot download all data for an entire country. For instance, running the code get_GSOD(year=2015, country='NZ', dsn=tempdir())
returns this error:Error in col[[i, exact = exact]] :
attempt to select more than one element in vectorIndex
\dontrun{}
so they are not actually tested.get_GSOD
just saves data to disk and does not actually return data. It would be brilliant if this package returned GSOD data so the user does not have to then load it back into R. To that end, I would recommend the function returns a single tibble
or SpatialPointsDataFrame
object (perhaps depending on an argument?), and let the user then re-save the data where they want it stored.vignette(package='GSODR')
yields 'no vignettes found'
. Additionally, the rmarkdown file for the vignette has an obscure name, so even if it could be opened within R, this would be tricky given its obscure name. I recommend naming the vignette after the package so the following code will open it vignette('GSODR')
.get_GSOD
is expected to fail but it does not actually test cases that are expected to succeed.?GSODR
). It would be really useful to include this, even if it it just a copy-paste from the README. See here for more details.nearest_station
function will be incredibly useful for a user. It would be even more useful if--instead of printing the stations closest to a user-specified point--it returned the stations. Then this function could be used programmatically to find the closest stations to a series of points, and download data accordingly.devtools::document()
before resubmission. When I ran devtools::run_examples()
, this automatically re-compiled some Rd files.foreach
iterators are used here. It would be great to do offer some parallelisation and options for progress bars here. If you wrapped this in a plyr::**ply
function, you could easily make it such that: (1) years can easily be processed in parallel, (2) if not processed in parallel, then a progress-bar can be shown.orginal_options <- options()
at the start of the function, and at the end of the function use options(original_options)
.threads
argument in the get_GSOD
function and defaulting it to 1.return(0)
is used after a call to stop()
. Please consider removing the return(0)
as this is unnecessary .dists <- fields::rdist.earth(as.matrix(stations[c('LON', 'LAT')]), matrix(c(LON, LAT), ncol=2), miles=FALSE)
nearby <- which(dists[,1] < distance)
utils::glob2rx
, this could be used instead pattern='^.*\\.gz$'
!vs %in% stations[[12]]
for clarity.Thanks, Jeff. I'm rather embarrassed that it didn't work for you.
I'll see what needs to be done to get it fixed and report back here when/as I do.
thanks for the review @jeffreyhanson ! a few notes:
\dontrun{}
is what we suggest in a package like this where examples make web requests, so I'd keep that - though if there's any examples that don't make web requests, definitely do put those outside of \dontrun
2nd reviewer @deniederhut assigned
@sckott Ok, thanks for correcting me on that. I just did a quick check and the nearest_stations
function also has its example wrapped in a \dontrun{}
. This function doesn't make web requests.
@jeffreyhanson, I've fixed the bug with the country request in the devel branch. Good catch, THANK YOU!
Your comment about the nearest_stations
is incorrect. It does make a web query with the .fetch_stations
function, thus is wrapped in a \dontrun{}
.
I'll work on some of the other issue's you've raised while I wait on my plane now.
Add threads option to get_GSOD
function for parallel processing …
https://github.com/adamhsparks/GSODR/commit/d404cb4b43cd0fc40365472df5c7a95674e7aeb8
Fix Documentation contains '<' which is not rendered … https://github.com/adamhsparks/GSODR/commit/84c210f1079b9e10409d4536c537e993f443d265
Replace pattern call on line 292 … https://github.com/adamhsparks/GSODR/commit/5c622afaae19618d8213f32ec7aab338247231dc
Fix bug when ISO2c/ISO3c is set, query fails https://github.com/adamhsparks/GSODR/commit/434f71e26a5bef87335e145a2d631f329dd6688e
Thanks, @jeffreyhanson, boarding now.
Update authors in paper.md https://github.com/adamhsparks/GSODR/commit/12f8683a9147b41aa7bbf76f88dffdfd90c3631c
Edit code for clarity
get_GOSD.R:631: This could rewritten be as !vs %in% stations[[12]] for clarity.
https://github.com/adamhsparks/GSODR/commit/83728444ac456da8b021032696ed469b0b208ac7
Check and reset original options()
get_GSOD.R:392--393: Options are restored to the default state but this is not necessarily the previous state. These values could be cached at the beginning of the function. For example: use orginal_options <- options() at the start of the function, and at the end of the function use options(original_options).
https://github.com/adamhsparks/GSODR/commit/960bb82e657b33a175cdac2e10d73fd755885bae
Renamed vignette to "GSODR"
https://github.com/adamhsparks/GSODR/commit/a0713f6602ec31981773c7a3d675c0082b5412c7
Additionally, @jeffreyhanson, If you install from github using devtools, devtools::install(build_vignettes = TRUE)
, will build the vignette, so then using vignette(package="GSODR")
a vignette is returned.
Added GSODR.R to create helpfile for package
There is no help page for the package (ie. ?GSODR). It would be really useful to include this, even if it it just a copy-paste from the README.
https://github.com/adamhsparks/GSODR/commit/a4044943d284e0cd44f53670b21995c33ee4a1fb
Removed return(0)
after stop
command
get_GSOD.R:561: Here and elsewhere return(0) is used after a call to stop(). Please consider removing the return(0) as this is unnecessary .
https://github.com/adamhsparks/GSODR/commit/9c4c24ea4d1221768b2083d86404fb29ce744959
Updated nearest_stations
function to return a list of stations for use in the get_GSOD
function's stations
parameter.
The nearest_station function will be incredibly useful for a user. It would be even more useful if--instead of printing the stations closest to a user-specified point--it returned the stations. Then this function could be used programmatically to find the closest stations to a series of points, and download data accordingly.
Also rewrote the function for computational efficiency.
nearest_stations.R:34--44: This code could be rewritten for clarity and computational efficiency.
https://github.com/adamhsparks/GSODR/commit/3a077575aaee54543ee4ee3aa93b9950c5355d80
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
URL
, Maintainer
and BugReports
fields in DESCRIPTIONEstimated hours spent reviewing: Maybe 2? This was conducted in ~20 minute chunks over the last two weeks, which makes it a little hard to guesstimate.
vignettes
and browseVignettes
function called on 'GSODR' returns no vignettes found
> system.time(get_GSOD(years = 2010, country = "Philippines", dsn = "~/", filename = "PHL",
+ GPKG = TRUE, CSV = FALSE, threads = 2))
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/2010/gsod_2010.tar'
Content type 'unknown' length 98222080 bytes (93.7 MB)
==================================================
downloaded 93.7 MB
Finished downloading file.
Parsing the indivdual station files now.
user system elapsed
4.391 11.714 147.535
>
> system.time(get_GSOD(years = 2010, country = "Philippines", dsn = "~/", filename = "PHL",
+ GPKG = TRUE, CSV = FALSE, threads = 1))
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/2010/gsod_2010.tar'
Content type 'unknown' length 98222080 bytes (93.7 MB)
==================================================
downloaded 93.7 MB
Finished downloading file.
Parsing the indivdual station files now.
user system elapsed
4.350 18.334 147.915
Thanks @jeffreyhanson and @deniederhut for your reviews! both reviews in now, @adamhsparks continue on, and let us know if you have any questions, want any feedback
Thanks folks. Excellent reviews and points raised in them.
I've been working through the review of @jeffreyhanson and will work on @deniederhut's as well.
I think this process is well worth it, especially for someone like me that is just learning how to put a package together. This will make this a much better package than it otherwise would have been.
@deniederhut, I did some checking on the parallelisation just today. It seems to be much faster in my testing. This is for two years, 2010:2011 and all stations in the data.
Using plyr
system.time(GSOD_XY <- plyr::ldply(.data = GSOD_list, .fun = .process_gz,
stations = stations))
user system elapsed
1055.835 71.492 2921.688
In parallel using foreach dopar loop (4 cores):
ity <- iterators::iter(GSOD_list)
system.time(GSODY_XY2 <- foreach::foreach(i = ity) %dopar% {
.process_gz(i, stations = stations)
}
)
user system elapsed
24.395 9.373 380.866
This is strictly the data formatting, cleaning, etc. not the whole process, so I'll do more checking through the whole package, but for now I think it's useful.
I've spent quite a bit of time working on this, I think it's ready for another round of review.
I've decided to release this as 1.0.0 when the review is complete. I think it's finally a suitable product that meets the needs of researchers. Hopefully I've addressed the reviewers comments sufficiently.
In addition to addressing the reviewer comments, which I've addressed below:
get_station_list
so that a user can use it to find stations of interest, and included a vignette that details how to use this functionality;reformat_GSOD
. This way users can download files using a FTP client, rather than the get_GSOD
function, and still take advantage of GSODR's ability to clean and reformat the data;
- All examples are wrapped in \dontrun{} so they are not actually tested.
Yes, since they require an active Internet connection and take longer to run than CRAN permits, they are wrapped in a \dontrun{} tag.
- The main function get_GSOD just saves data to disk and does not actually return data. It would be brilliant if this package returned GSOD data so the user does not have to then load it back into R. To that end, I would recommend the function returns a single tibble or SpatialPointsDataFrame object (perhaps depending on an argument?), and let the user then re-save the data where they want it stored.
A data.frame object is now returned as well as the option within the function to define a filename and where to store the file with two options: CSV and a spatial format GeoPackage file.
- Unit testing covers cases where get_GSOD is expected to fail but it does not actually test cases that are expected to succeed.
I've revised the unit testing to cover more of the functions within the package accordingly.
- There is no help page for the package (ie. ?GSODR). It would be really useful to include this, even if it it just a copy-paste from the README. See here for more details.
I've updated the documentation to include a help file that includes the main "description" and links to the GSODR functions and data documentation.
- The nearest_station function will be incredibly useful for a user. It would be even more useful if--instead of printing the stations closest to a user-specified point--it returned the stations. Then this function could be used programmatically to find the closest stations to a series of points, and download data accordingly.
I've updated this accordingly, I've included a more detailed vignette on how someone could use this function to find stations in a specific part of a country of interest.
- The accompanying paper does not list authors in the html document.
All of the author details are complete in the md file, however they don't appear in the HTML version when it is knit.
- This package needs a rerun of devtools::document() before resubmission. When I ran devtools::run_examples(), this automatically re-compiled some Rd files.
The documentation is updated and expanded.
- get_GSOD.R:20 & data.R:29: Both lines refer to a document available on the github repository. I recommend making this document an additional vignette.
There are two Rmd files that document data sources in this package. Technically this isn't any functionality offered by this package or related to using the package. It's documentation that is provided to detail how these elevation data were derived, so they are in the "data-raw" directory, per Hadley's R packages book/webpage. I do not feel that it is necessary to include these as vignettes since the end user does not need to run these functions and they are not functions provided by GSODR.
- As far as I can tell, the vignette cannot be accessed within R. After installation, the following code vignette(package='GSODR') yields 'no vignettes found'. Additionally, the rmarkdown file for the vignette has an obscure name, so even if it could be opened within R, this would be tricky given its obscure name. I recommend naming the vignette after the package so the following code will open it vignette('GSODR').
I've renamed the main vignette to "GSODR" and added two more, one detailing working with GeoPackage files and one with a more detailed look at how to find and download stations for a specific area of interest.
Additionally, when installing from GitHub use, devtools::install_github("adamhsparks/GSODR", build_vignettes = TRUE)
10a. get_GSOD.R:279: I am unsure why foreach iterators are used here. It would be great to do offer some parallelisation and options for progress bars here. If you wrapped this in a plyr::**ply function, you could easily make it such that: (1) years can easily be processed in parallel, (2) if not processed in parallel, then a progress-bar can be shown.
and
10b. get_GSOD.R:345: the user should have explicit control over thread use to avoid accidently fork-bombing their own computer. I recommend providing a threads argument in the get_GSOD function and defaulting it to 1.
Per @deniederhut's comment about parallelisation not improving the speed, I've tested and agree, parallelisation does not render the process significantly faster. I've removed this capability and wrapped these functions in plyr
functions, which provides progress bars.
11a. get_GSOD.R:392--393: Options are restored to the default state but this is not necessarily the previous state. These values could be cached at the beginning of the function. For example: use orginal_options <- options() at the start of the function, and at the end of the function use options(original_options).
and
11b. nearest_stations.R:28--29: This function changes options() and does not reset them after it finishes. This is against recommended CRAN practices. Please see previous comment on options for a suggestion on how to fix this.
This has been fixed
- get_GSOD.R:322: ROpensci guidelines recommend using the httr package instead of the RCurl package.
I've updated this with assistance from @hrbrmstr, RCurl is no longer used.
- get_GSOD.R:561: Here and elsewhere return(0) is used after a call to stop(). Please consider removing the return(0) as this is unnecessary .
These have been removed
- nearest_stations.R:34--44: This code could be rewritten for clarity and computational efficiency. I suggest something like this:
This was rewritten as suggested
- get_GSOD.R:83--85: Documentation contains '<' which is not rendered as a less than symbol in the html documentation.
This has been corrected
- get_GSOD.R:292: Instead of a call to utils::glob2rx, this could be used instead pattern='^.*\.gz$'
This has been changed as suggested
- get_GOSD.R:631: This could rewritten be as !vs %in% stations[[12]] for clarity.
This has been changed as suggested
- vignettes and browseVignettes function called on 'GSODR' returns no vignettes found
This has been fixed, when installing from GitHub use, devtools::install_github("adamhsparks/GSODR", build_vignettes = TRUE)
- it seems odd to me the enforce a missingness limit at download time -- I think I would either let the user do this in memory, or set the default limit to 0
This has been changed to not enforce any limit with the user having the option to set missing values as they wish.
- the test suite only seems to be concerned with handling bad input - am I missing the blocks where it is testing that data are returned correctly?
The unit tests have been expanded
- allowing multiple threads for calculation doesn't speed up the execution time as much as I would expect:
I tested this and found the same as well. Parallelisation has been removed to simplify the package dependencies and structure
Thanks to the reviewers for their excellent contributions and comments. It's helped to make this package a much more robust and useful package.
Thanks @adamhsparks for the changes and the responses to reviewers
@jeffreyhanson @deniederhut do either or both of you have time to take another look? if not, i'll do so. let me know
@sckott Yes, but not until after Christmas. Is that timeline okay?
On Fri, Dec 2, 2016, 03:22 Scott Chamberlain notifications@github.com wrote:
Thanks @adamhsparks https://github.com/adamhsparks for the changes and the responses to reviewers
@jeffreyhanson https://github.com/jeffreyhanson @deniederhut https://github.com/deniederhut do either or both of you have time to take another look? if not, i'll do so. let me know
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/onboarding/issues/79#issuecomment-264433061, or mute the thread https://github.com/notifications/unsubscribe-auth/AGjNrEGh_C7fEX9pYjukX5J7fAG6TSuKks5rD_9XgaJpZM4KWnlq .
I'll probably not be available to do much until after Christmas either. So I'm in no hurry at this point.
thanks @deniederhut
I'm pretty busy until Christmas too, so I'll take a look early in the new years.
thanks @jeffreyhanson
@jeffreyhanson @deniederhut if you get a chance soon have a look at submitter replies above and let us know if there's anything else you'd like changed/improved
@sckott It's on my to-do list for this weekend.
Update. I've made a minor change to the package that does not affect the functionality.
Due to the size of the initially added climate data and the addition of yet more data by myself just this week, the package size was 5.1Mb. I've moved the additional climate data to its own R package that's available only from GitHub, https://github.com/adamhsparks/GSODR.data.
This is referenced in the README and a vignette.
okay, you do plan to put it on CRAN though since I imagine this will go to CRAN.
GSODR is on CRAN and I'll update it once this review is done. The other package, GSODR.data is too large for CRAN unless it's split into two packages or I remove some of the data sets.
I'm happy to just leave it on GitHub since it's not essential for the use of GSODR.
Thoughts?
Ah, if not required for GSODR, then all good,
I guess maybe I wasn't initially clear. It's four sources of climate data that are formatted for use with the data GSODR provides. You can link these data to GSOD data by station ID and LAT/LON.
I have one example in a vignette using a one of the data sets, but for normal use it's not required. It just makes this data more available for use with GSOD data output produced by GSODR.
So since it's mentioned in a vignette, does that require a CRAN version?
edit I could put just the dataset that's referenced in the GSODR package itself and leave the other three as the optional download from GitHub. Is that a good option?
Okay, vignettes look good! I think I would still like to see some explicit correctness check in the unit tests somewhere, perhaps with a cached result, but I am recommending this package for approval. A few bugs I ran into while testing:
get_GSOD(years = 2010, station = "955510-99999")
raises namespace error -- Error: 'safely' is not an exported object from 'namespace:purrr'
... probably in line 490 or 507Error in fetch(key) :
lazy-load database '/Library/Frameworks/R.framework/Versions/3.2/Resources/library/GSODR/help/GSODR.rdb' is corrupt
max_missing
seems to accept negative values and NA
Thanks, I have fixed # 3 easily. The other two I'm not so sure of, I've never run into either one with all my testing.
Hi, thanks for the reminder. I'll take a look at this sometime this coming week.
its possible you have @deniederhut purrr
v0.1
, since safely
is not in that version https://github.com/hadley/purrr/blob/v0.1.0/NAMESPACE - @adamhsparks plz add a minimum version of purrr
to your DESCRIPTION file
lazy-load database ... is corrupt
this is solved by restarting R usually
thanks @jeffreyhanson
@sckott, I've added the minimum version of purrr to DESCRIPTION. Thanks for the tip.
I'm still on holiday in New Zealand and won't be back till the 18th. I was hoping that the local library's internet would be good enough to let me test the package, but that doesn't seem to be the case.
Would you mind if I submitted my review when I'm back home?
Also, I noticed that the DESCRIPTION is missing a Maintainer field.
@jeffreyhanson,
I'm following Hadley's preference, using Authors@R
You can also use separate Maintainer and Author fields. I prefer not to use these fields because Authors@R offers richer metadata.
@adamhsparks Ok, sounds good.
Summary
Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
with a high-level description in the package root or ininst/
.Detail
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings: