Closed azizka closed 6 years ago
Thanks for your submission @azizka! :grin: Initial checks revealed a few issues that you should solve before I assign reviewers. Do not hesitate to ask any question you might have!
Your package doesn't use roxygen2
yet for documentation. Our packaging guide says we strongly encourage it but we'd actually like to start enforcing it, partly to help future contributors. To convert your existing documentation to roxygen2
documentation you can use the Rd2roxygen
package. When using roxygen2
you won't have to updated NAMESPACE by hand.
What is the licence on extra_gazeetters? Could these files live in a separate data package? (question, not request for change)
Why have the tar.gz and zip of your package in the repo especially since you already make them available via the use of GitHub releases?
Here is goodpractice::gp()
output with comments.
It is good practice to
? write unit tests for all functions, and all package
code in general. 62% of code lines are covered by test cases.
R/cc_cap.R:18:NA
R/cc_cap.R:19:NA
R/cc_cap.R:29:NA
R/cc_cen.R:19:NA
R/cc_cen.R:21:NA
... and 580 more lines
You do not need to reach 100% code coverage but you can increase it a bit.
? not use "Depends" in DESCRIPTION, as it can cause
name clashes, and poor interaction with other packages. Use
"Imports" instead.
Feel free to ignore this if you need the sp
Depends. By the way, is there any reason why you don't use the more modern sf
? (question, not request for change)
? omit "Date" in DESCRIPTION. It is not required and it
gets invalid quite often. A build date will be added to the
package when you perform `R CMD build` on it.
Please remove Date from there.
? add a "URL" field to DESCRIPTION. It helps users find
information about your package online. If your package does not
have a homepage, add an URL to GitHub, or the CRAN package
package page.
? add a "BugReports" field to DESCRIPTION, and point it
to a bug tracker. Many online code hosting services provide bug
trackers for free, https://github.com, https://gitlab.com, etc.
Make it easy to find your GitHub repo by adding these links.
? avoid long code lines, it is bad for readability.
Also, many people prefer editor windows that are about 80
characters wide. Try make your lines shorter than 80 characters
R\cc_cap.R:1:1
R\cc_cap.R:24:1
R\cc_cen.R:1:1
R\cc_coun.R:1:1
R\cc_coun.R:17:1
... and 210 more lines
Maybe styler
can help with this.
? avoid sapply(), it is not type safe. It might return
a vector, or a list, depending on the input data. Consider
using vapply() instead.
R\WritePyRate.R:88:68
Please change sapply
in favor of vapply
here.
? not import packages as a whole, as this can cause
name clashes between the imported packages. Instead, import
only the specific functions you need.
See this part of Hadley Wickham's R package book
? not use exportPattern in NAMESPACE. It can lead to
exporting functions unintendedly. Instead, export functions
that constitute the external API of your package.
When using roxygen2
syntax you'll use the export tag before each function to be exported.
? fix this R CMD check NOTE: Note: found 1 marked
Latin-1 string
Not sure what this means actually.
? fix this R CMD check NOTE: Found the following hidden
files and directories: .travis.yml These were most likely
included in error. See section 'Package structure' in the
'Writing R Extensions' manual.
Please put such files and folders in .Rbuildignore.
Please update this thread once you've done the changes, or as soon as you have a question!
Reviewers: @isteves @pakillo Due date: 2018-06-07
@azizka do you have any question or comment? ☺
Hi @maelle sounds all good, thanks. I am working to include most of your suggestions. I'll defend my PhD next week, so little slow working on the package at the moment, but will tackle this full on afterwards.
Oh good luck with your defence!
@azizka I hope your defence went well! I'm taking the risk to say congrats! 🎊
Did you have time to work on the package a bit?
@maelle yes, it worked out fine, thanks for congratulations and your patience! I am back to work this week, working on the package now.
🎊 Grattis! 🎊
and cool!
Hi @maelle,
Please find below replies to your comments. I hope you find everything covered, please let me know what else I can do. Thanks again for the patience :-).
Note: the most recent version is working, I’ll updated the badges as soon as the problems with travis are solved (https://github.com/travis-ci/travis-build/pull/1382)
Replies:
Your package doesn't use roxygen2 yet for documentation
Done - Migrated documentation and namespace generation to roxygen2
What is the licence on extra_gazeetters? Could these files live in a separate data package?
The extra gazetteers are open source, they could live in a separate data package, but I am not sure that this is necessary/useful since they have very specialized applicability.
Why have the tar.gz and zip of your package in the repo especially since you already make them available via the use of GitHub releases?
Done - Removed the .tar.gz and zip
You do not need to reach 100% code coverage but you can increase it a bit.
It is at 74% now
Feel free to ignore this if you need the sp Depends. By the way, is there any reason why you don't use the more modern sf? (question, not request for change)
Done - moved sp from ‘Depends’ to ‘Imports’. Yes, sf would be great since its faster and tidy, but unfortunately some of the dependencies do not yet support sf formats (e.g. raster, geosphere).
Please remove Date from there
Done - Removed the date from the description file
Make it easy to find your GitHub repo by adding these links.
Done - Added the urls to the description file
Avoid long code lines, it is bad for readability.
Done – cut at 80
Please change sapply in favor of vapply here
Done
not import packages as a whole, instead, import only the specific functions you need
Done
When using roxygen2 syntax you'll use the export tag before each function to be exported.
Done – switched to roxygen2
fix this R CMD check NOTE: Note: found 1 marked, Latin-1 string
I could not reproduce this note, do you have additional information on the operating system and check settings, etc? I suspect it might be related to a non-ASCII character in one of the datasets in the package, but I am not entirely sure either.
Please put such files and folders in .Rbuildignore.
Done, the .travis.yml is part of .Rbuildignore now
@azizka thanks for all your work! Am going to assign reviewers in a bit. Two last comments before the reviews:
could you please add a code of conduct to your repo? cf our guidelines
could you please respond to the two issues opened in your repo? If these are bugs, please solve them before the reviews and update this thread. :-)
:wave: @isteves @Pakillo! Thanks for accepting to review this package! Your reviews are due 2018-06-07 but @pakillo I've already noted that yours might be late (@azizka the reviewers of your package were lined up a while ago :wink:)
Here is our reviewer template and our reviewer guide.
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).Estimated hours spent reviewing: 7
Hi @azizka - great, comprehensive package! I'm impressed! Here are some comments related to the sections above:
Tutorials
folder & wiki; consider moving them to a vignettes
folder so that users can access them locally using browseVignettes()
load("extra_gazetteers/urbanareas.rda")
, load("your/path/urbanareas.rda")
WritePyRate(x = exmpl,fname = "test", status = status)
: Error in [.data.frame
(x, , taxon) : undefined columns selectedplot
call is not present in the example for plot.spatialvalid
.library(CoordinateCleaner)
calls not needed within individual test scriptsTest_fossillevel_cleaning.R
filename should be all lower case for consistencycontext()
devtools::test()
, I got the following warning - encoding
is deprecated; all files now assumed to be UTF-8 - as well as these: test_wrapper_functions.R:16: warning: (unknown) running GBIF test, flagging records around Copenhagen
test_wrapper_functions.R:28: warning: CleanCoordinates countries argument produces correct output running GBIF test, flagging records around Copenhagen
test_wrapper_functions.R:96: warning: fossil wrapper cleaning works countries missing, countrycheck set to FALSE
test_wrapper_functions.R:96: warning: fossil wrapper cleaning works running GBIF test, flagging records around Copenhagen
test_wrapper_functions.R:97: warning: fossil wrapper cleaning works countries missing, countrycheck set to FALSE
test_wrapper_functions.R:97: warning: fossil wrapper cleaning works running GBIF test, flagging records around Copenhagen
Here are some additional comments. Let me know if anything is unclear!
All the text/documentation looks great on the whole! The devtools::spell_check()
caught two typos that should be fixed:
Herbariorum institutions.Rd:13
THe CleanCoordinates.Rd:101
(The other "spelling errors" that it caught seem to be fine.)
For the most part, the package is both clean and organized. I especially like the cc_
functions, which each tackle a specific type of error and have consistent styles.
Among the other files, there was a mix of style/naming conventions. Is there a reason for that? (asking honestly here, since I may be missing something) I understand that for methods.spatialvalid
you must use the .
convention to work with classes, but I'm unsure why you switched to capitals for CleanCoordinates*
and WritePyRate
. These functions (as well as tc_*
and the tutorials) also include a lot of variables with .
rather than _
. Changing the variable names so that they use a consistent style would enhance the readability/usability of the code.
In the CleanCoordinates*
functions, it seems that you'd be able to greatly reduce the amount of code through some of these structural changes:
...
to pass arguments specified in CleanCoordinates*
to the cc_*
functions called within ittests
) that can take a vector of desired tests (c("sea", "gbif", "equ")
)capitals.ref = NULL
), instead of in an extra step within the functionCurrent code
if (capitals) {
cap <- cc_cap(x,
lon = lon, lat = lat, buffer = capitals.rad, ref = capitals.ref,
value = "flags", verbose = verbose
)
} else {
cap <- rep(NA, dim(x)[1])
}
Proposed change
df <- data.frame(matrix(NA, nrow = nrow(x), ncol = 12))
colnames(df) <- c("val", "equ", "zer", "cap", "cen", "sea", "urb", "con",
"otl", "gbf", "inst", "dpl")
if (capitals) {
df$cap <- cc_cap(x, value = "flags", ...)
}
#and so on for the other tests
#no `else {cap <- rep(NA, dim(x)[1])}` required
CleanCoordinates*
functions also have warnings/checks that perhaps are more appropriate within the cc_*
functions (example)is.null(countries)
is TRUE, then countries does not need to be set to NULL againif (is.null(countries)) {
countries <- NULL
if (countrycheck) {
countrycheck <- FALSE
warning("countries missing, countrycheck set to FALSE")
}
}
cc_dupl
can be used in place of this codeIn WritePyRate, since fname
and status
are required, consider moving them earlier in the argument list (positions 2, 3) and not setting a default value (currently NULL)
In cc_val
(starting here), there is redundancy in as.numeric
/as.character
/suppressWarnings
that could be reduced. For example, like this:
x[[lon]] <- suppressWarnings(as.numeric(as.character(x[[lon]])))
x[[lat]] <- suppressWarnings(as.numeric(as.character(x[[lat]])))
out <- list(is.na(x[[lon]]),
is.na(x[[lat]]),
abs(x[[lon]]) > 180,
abs(x[[lat]]) > 90)
Currently, the map of the different tests distinguishes clean and flagged points by color, and distinguishes tests by shape. Are the different tests important? If not, perhaps they can grouped into a "flagged" category. If yes, then perhaps color could be used to better distinguish them.
The description of value
in the Rd files (for example, here) can be clarified by moving ("flags")
up in the sentence:
Depending on the ‘value’ argument, either a data.frame containing the records considered correct by the test (“clean”) or a logical vector (“flags”), with TRUE = test passed and FALSE = test failed/potentially problematic. Default = “clean”
Also, it may help to stick with options for value
that are both verbs or both adjectives (but this is just a minor suggestion). For example: clean/flag, clean/flagged, cleaned/flagged.
??CoordinateCleaner
has strange output: c("\Sexpr[results=rd,stage=build]tools:::Rd_package_title(\"#1\")", "CoordinateCleaner")Automated Cleaning of Occurrence Records from Biological Collections
I'm no sure where this error comes from.
Many thanks for your review @isteves, great points! 😸
No maintainer specified in DESCRIPTION, but maybe ok? It is ok because it's generated automatically from Authors@R when building the package, it's actually better not having it.
@azizka You can wait for @Pakillo's review before responding, or respond before that, since @Pakillo's review was going to be a bit late, so as you prefer.
Thanks again @isteves!
@isteves I forgot to ask you to fill the "Estimated hours spent reviewing:" field in the review template, thanks in advance 🙏
@Pakillo :wave: will you soon have time to review the package? 😺
Hej @maelle and @isteves,
thanks for the comprehensive, helpful and very constructive review. I managed to address most of your issues, and argued back/put questions @maelle for help on some of them. Please find a point-by-point reply below, replies in italics. Please let me know if you have any comments/suggestions.
Thanks again!
Examples are currently included in the Tutorials folder & wiki; consider moving them to a vignettes folder so that users can access them locally using browseVignettes()
Done, moved tutorials to vignettes
urbanareas: perhaps specify that file path is not the one shown: instead of load("extra_gazetteers/urbanareas.rda"), load("your/path/urbanareas.rda")
Done, changed as suggested
WritePyRate: got an error when I tried WritePyRate(x = exmpl,fname = "test", status = status): Error in [.data.frame(x, , taxon) : undefined columns selected
Done, fixed the code of the example (replaced "identified_name" by "accepted_name")
A plot call is not present in the example for plot.spatialvalid.
Done added plot call to the example.
No maintainer specified in DESCRIPTION, but maybe ok?
No changes, see @maelle ‘s comment
No contribution guidelines in the README or CONTRIBUTING
Done added a CONTRIBUTING.md
library(CoordinateCleaner) calls not needed within individual test scripts
Done removed the library() calls
Test_fossillevel_cleaning.R filename should be all lower case for consistency
Done, changed to lowercase
I suggest using more descriptive names for context()
Done
Using devtools::test(), I got the following warning.
Done, switch from warning to message for cc_gbif
according to the guidelines, badges in README.md should be below the package name (unsure how important this is)
Done, moved badges downwards
include "how the package compares to other similar packages and/or how it relates to other packages" in README.md
I am not sure what exactly is needed here. I could add the table comparing CoordinateCleaner with scrubr, from the presubmission query (https://github.com/ropensci/onboarding/issues/199), but this seems excessive. Is this really necessary @maelle?
add citation info to end of in README.md
Done, added a suggested citation, will change to final once the corresponding manuscript is published
"Avoid starting the description with the package name or This package ..." - not sure if this also applies to the README package description, but you may want to change it to match the description in the DESCRIPTION file.
Done, synchronized the first sentences among DESCRIPTION and Readme.md
All the text/documentation looks great on the whole! The devtools::spell_check() caught two typos that should be fixed: Herbariorum institutions.Rd:13 THe CleanCoordinates.Rd:101 (The other "spelling errors" that it caught seem to be fine.)
Done, fixed number 2, number 1 is not a typo but the correctly latin spelling of the name of the institution “Index Herbariorum”
For the most part, the package is both clean and organized. I especially like the cc functions, which each tackle a specific type of error and have consistent styles. Among the other files, there was a mix of style/naming conventions. Is there a reason for that? (asking honestly here, since I may be missing something) I understand that for methods.spatialvalid you must use the . convention to work with classes, but I'm unsure why you switched to capitals for CleanCoordinates* and WritePyRate. These functions (as well as tc* and the tutorials) also include a lot of variables with . rather than _. Changing the variable names so that they use a consistent style would enhance the readability/usability of the code.
Yes, that is a good point. The wrapper functions originally had CamelCase to show that they differed from the individual tests. I agree that this was inconsistent, and switched to a consistent underscore_case naming scheme. Since this might causes some compatibility problems I increase the version number to 2.x with the latest version.
use ... to pass arguments specified in CleanCoordinates to the cc_ functions called within it
I’d prefer not to do this because (i) this would mean to change the argument names in all the cc functions, and somewhat break their consistency, as for instance the “buffer” argument in cc_cap would need to become cap_buffer and in cc_cen cen_buffer and (ii) while it is true that the list of arguments is long, currently it is obvious also to unexperienced users of CleanCoordinates, that the buffers can be changed and should be thought about.
rather than writing out TRUE/FALSE for each test, perhaps use a single argument (tests) that can take a vector of desired tests (c("sea", "gbif", "equ"))
Done, switched to a character vector for all wrapper functions
set argument defaults to NULL directly in the argument list (capitals.ref = NULL), instead of in an extra step within the function
Done for all wrapper functions
currently, a variable is returned for each test. If FALSE is chosen, then the returned variable contains only NA's; if TRUE is chosen, the specified test is run and the flagged output is saved. This results in code repetition that may be avoided by (a) specifying the NA output as the default in the beginning of the code and (b) using a data frame/list structure to store the results. To clarify:
Done for all wrapper functions
the CleanCoordinates functions also have warnings/checks that perhaps are more appropriate within the cc_ functions (example)
I’d prefer to keep them where they are, since I think this way the error message can be more informative for the user of the CleanCoordinates wrapper function, i.e. it can directly be specified how to fix the problem, and drop certain tests if they are causing the error. For instance, if no country information is available, just omitting the countries test from clean_coordinates might be a good solution, whereas cc_count does absolutely require the country information.
as far as I can tell, cc_dupl can be used in place of this code
Done, replaced by cc_dupl
In WritePyRate, since fname and status are required, consider moving them earlier in the argument list (positions 2, 3) and not setting a default value (currently NULL)
Done, moved the elements up in the list and removed the default.
In cc_val (starting here), there is redundancy in as.numeric/as.character/suppressWarnings that could be reduced.
Done, changed code as proposed
Currently, the map of the different tests distinguishes clean and flagged points by color, and distinguishes tests by shape. Are the different tests important? If not, perhaps they can grouped into a "flagged" category. If yes, then perhaps color could be used to better distinguish them.
Done. The marking of the individual tests can be switched on or off with the details option of plot.spatialvalid. I think it can be interesting to visualize which tests failed, for instance to see if there are spatial patterns in data quality, but I agree that it might also be confusing. I switched the default for details to FALSE, so that by defaults now there is only a colour difference between clean and flagged.
The description of value in the Rd files (for example, here) can be clarified by moving ("flags") up in the sentence: Depending on the ‘value’ argument, either a data.frame containing the records considered correct by the test (“clean”) or a logical vector (“flags”), with TRUE = test passed and FALSE = test failed/potentially problematic. Default = “clean”
Done, moved the part in the brackets as suggested
Also, it may help to stick with options for value that are both verbs or both adjectives (but this is just a minor suggestion). For example: clean/flag, clean/flagged, cleaned/flagged.
Done, changed to clean/flagged
??CoordinateCleaner has strange output: c("\Sexpr[results=rd,stage=build]tools:::Rd_package_title(\"#1\")", "CoordinateCleaner")Automated Cleaning of Occurrence Records from Biological Collections I'm no sure where this error comes from.
I have no clue what is going on here. Any ideas @maelle?
An additional question:
The latest build passes without problems, but fails on travis (see below). I assume the vignettes are to RAM heavy. I tried to solve this, but couldn't. Do you have any experience with this problem? Thanks!
Warning in system(command) : system call failed: Cannot allocate memory Warning in system(command) : error in running command Error: processing vignette 'Cleaning_PBDB_fossils_with_CoordinateCleaner.Rmd' failed with diagnostics: pandoc document conversion failed with error 127
Thanks @azizka!
The table comparing CoordinateCleaner
with scrubr
might be excessive but adding a few lines about how CoordinateCleaner
fits into the occurrence data R tooling, and where scrubr
might be used in combination, would be good.
Did you generate the package level documentation using usethis::use_package_doc()
?
Could you ask the Travis question in Slack to see whether someone had the same issue? I'm afraid I can't help directly.
Alright, I added the following text to the Readme.md (including links): CoordinateCleaner can be particularly useful to ensure geographic data quality when using data from GBIF (e.g. obtained with rgbif) for historical biogeography (e.g. with BioGeoBEARS or phytools), automated conservation assessment (e.g. withspeciesgeocodeR or conR) or species distribution modelling (e.g. with dismo or sdm. See scrubr and taxize for complementary taxonomic cleaning.
No, I migrated the documentation to Roxygen2 with Rd2roxygen as you suggested earlier.
Great, I'll ask on slack.
Thanks :-)
:wave: @pakillo could you please get your review in soon? Thank you! 😺
I'm very sorry for the delay -- the end of term workload is being rather crazy. I expect to have it ready this week. Apologies
Thanks for the update @Pakillo!
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).For packages co-submitting to JOSS
- [ ] The package has an obvious research application according to JOSS's definition
The package contains a
paper.md
matching JOSS's requirements with:
- [ ] A short summary describing the high-level functionality of the software
- [ ] Authors: A list of authors with their affiliations
- [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
- [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).
Estimated hours spent reviewing: 8
Hi all,
Many thanks @azizka for developing CoordinateCleaner, @maelle for the invitation to review (and the patience!), and @isteves for sharing the review and breaking the ice :)
I love CoordinateCleaner. I have already used for my own research and it really fills a gap in terms of cleaning occurrence data (or any coordinate data, really). It brings many new functionalities, it is well thought, relatively easy to use, and it works. So thanks a lot.
I agree with all comments by @isteves above. I think they have greatly contributed to make the package even nicer and more consistent. Some remaining issues or comments:
Installation of current version from GitHub takes a 93 MB download, which seems maybe too much? Apparently ~70 MB of it is git history and files (.git folder). I wonder if there is some big file hidden in history, which could perhaps be removed (e.g. using https://rtyley.github.io/bfg-repo-cleaner/). That package size was not a problem for me but could be for some users with slow connections. I think rOpenSci experts can advise more on this, or confirm this is actually OK.
Apart from git history, there are some big files in the current version of the repo. First, two largish PDF files of rendered vignettes in Tutorials
folder which can probably be removed? I am not sure about rOpenSci conventions here, but maybe you can store just HTML rendered versions of the vignettes (which weigh much less, ~110 KB for the GBIF vignette) in a gh-pages branch so they can be seen nicely. Or, even better, use pkgdown
to create a website for your package, including not only rendered vignettes but also rendered examples, etc.
The other second set of large files are data files (particularly GIS datasets like landmass.rda, >2 MB, or urbanareas.rda, >10MB). As most of these datasets actually come from Natural Earth, I think they could perhaps be downloaded on purpose when needed, probably using rnaturalearth
package. Also, some of these datasets are directly available in rnaturalearthdata
and rnaturalearthhires
packages (all already part of rOpenSci), so you could maybe just use them so as not to duplicate files.
The Readme is good, just a few typos and minor comments. I have also made a pull request fixing some typos I found along the way.
Typo in badge linking to Travis (fixed in pull-request). Also, Travis is giving build error, but I see you are already on track.
Provide hyperlinks to NEWS, vignettes, contributing files, so the reader just have to click from README to see them?
When discussing related packages, I would also mention biogeo
(https://github.com/cran/biogeo).
Use README.Rmd? So the reader can see the output of the example code in README without having to run it by themselves. Maybe do not show all output, but some of it.
Current example code gives errors. I think it needs updating after the API changes:
dsl <- clean_coordinatesDS(exmpl)
gives error (Changed to dsl <- clean_dataset(exmpl)
in pull request)
clean_coordinatesFOS
must be changed to clean_fossils
dc_ddmm
does not exist. Change to cd_ddmm
?
tc_range
does not exist. Has it changed to cf_range
?
As explained above, consider removing large data files from the package? Use rnaturalearth
and associated packages instead (as some functions, e.g. cc_coun
, already do). Maybe only buffland_1deg
(currently in extra_gazetteers
) would need to be included.
Looks like data in capitals.rda
(and parts of centroids.rda
) are already contained in countryref.rda
. If possible, avoid duplication.
institutions
: If there is an script used to build this data frame, it might be good to include it (e.g. in a data-raw folder) for the sake of reproducibility and possibility of making further changes in the future.
urbanareas.rda
is >10 MB big, and countryborders
>2 MB. As all these seem to come from Natural Earth, consider using rnaturalearth
to access them when needed?
institutions_utf8
seems duplicated with institutions.rda
in data folder? (only having 1 row more). Keep only one?
I think the documentation (Rd) files in this folder are not available to the user. Move all the data to the data
folder, and make sure all are documented?
man
folder)Rd file for CoordinateCleaner-package.Rd
needs to be revised. Seems the conversion to ROxygen hasn't worked well.
I think the help files for CleanCoordinates
, CleanCoordinatesDS
, CleanCoordinatesFOS
, should mark them as Deprecated, see https://ropensci.github.io/dev_guide/evolution.html#functions-deprecate-defunct.
Default longitude and latitude arguments for fossils functions (cf_outl
, cf_range
...) are "lng" and "lat", respectively, while the params explanation states the default are "decimallongitude" and "decimallatitude" (as in the rest of the package). Use the latter throughout?
Several functions using Spatial* objects assume they already have a geographical (latlon) projection (e.g. cc_cap
, L67-68). As a suggestion, given they are Spatial objects it might be better to read their projection (proj4string) directly and ensure it is indeed correct. Otherwise, give warning and optionally make reprojection on the fly? (e.g. using spTransform). Also, as sf
will become more and more popular consider enabling them also as arguments, not only sp
objects (e.g. doing an internal conversion as(sf, "Spatial")
).
cc_coun
: Although there exists cc_sea
, I think it would be good if this function flagged every coordinate not within the supposed country, even if those coordinates fall on sea.
cc_sea
gives TRUE if coordinates fall on land, and FALSE if coordinates fall on sea. I think this is confusing and may induce errors in future users (expecting TRUE if falling on sea, given the function name). To respect the convention across the whole package about TRUE/FALSE, maybe then rename the function as cc_land
? Which would give TRUE when a location really falls on land.
cc_urb
: download urbanareas from Natural Earth on demand?
plot.spatialvalid
seems to use landmass.rda
for plotting, but it could also use borders
from ggplot2 (probably lighter and faster, and in case landmass is finally taken out of the package).
clean_dataset
: the example flags <- clean_dataset(test)
gives warning: In FUN(X[[i]], ...) : Geographic spann to small, check 'min_span'
The tidyverse
package is listed in Suggests and effectively loaded in vignettes, but I think only a small subset of tidyverse packages are needed? Maybe better to use only those?
Is the Tutorials folder with rendered PDF vignettes necessary? Can use lighter HTML vignettes than PDF? Use pkgdown?
The PBDB vignette PBDB doesn't run for me. It actually seems a problem with the paleobioDB package?
dat <- pbdb_occurrences(base_name = "Magnoliopsida", vocab = "pbdb", limit = 5000,
show = c("coords", "phylo", "attr", "loc", "time", "rem"))
Error in !sapply(data, function(l) is.null(l) | (ncol(l) == 0) | (nrow(l) == :
invalid argument type
The vignettes need a further revision to update links, pkg versions, typos, etc.
Thanks a lot for your thorough review @Pakillo! 😀
@azizka now both reviews are in!
@azizka with regards to @Pakillo's tidyverse comment, I saw a post about this exactly on Twitter recently: https://twitter.com/dataandme/status/1009412422476713984?lang=en
The post it links to: https://www.tidyverse.org/articles/2018/06/tidyverse-not-for-packages/
Thanks @isteves ! Always good to be backed by Mara and Hadley ;-)
Regarding my comment about cc_sea
:
cc_sea gives TRUE if coordinates fall on land, and FALSE if coordinates fall on sea. I think this is confusing and may induce errors in future users (expecting TRUE if falling on sea, given the function name). To respect the convention across the whole package about TRUE/FALSE, maybe then rename the function as cc_land? Which would give TRUE when a location really falls on land.
I later realised that all functions in the package follow the same scheme, e.g. cc_urb
also gives TRUE if coordinates are OK and FALSE if coordinates fall in urban area. This scheme might still confuse some quick users at first, because a TRUE result means actually a negative response to the question asked by the function (does it fall on sea? does it fall on urban area? etc - TRUE means NO). But as it is consistent across the package, it's probably OK.
In other words, please ignore my comment above about cc_sea
!
Thanks
:wave: @azizka do you have an idea of when you'll be able to respond to @Pakillo's review? Thank you!
Hi @Pakillo, thanks for the great review. @maelle I am currently without stable internet, but will work on this next week. Cheers!
Thanks for the update @azizka, enjoy your time off!
Hej,
I have now addressed most comments of the second review. The last thing remaining is the error reported by @Pakillo with paleobioDB
The PBDB vignette PBDB doesn't run for me. It actually seems a problem with the paleobioDB package? dat <- pbdb_occurrences(base_name = "Magnoliopsida", vocab = "pbdb", limit = 5000, show = c("coords", "phylo", "attr", "loc", "time", "rem")) Error in !sapply(data, function(l) is.null(l) | (ncol(l) == 0) | (nrow(l) == : invalid argument type
I could not reproduce this error, but it also seems to be the problem that crashes travis. @Pakillo which operating system and R version did you use?
Am looking at that vignette. There shouldn't be a chunk wit eval=TRUE
with install_github("ropensci/paleobioDB")
, since the package is in Suggests.
I get the error too, here it is below with session info.
dat <- paleobioDB::pbdb_occurrences(base_name = "Magnoliopsida", vocab = "pbdb", limit = 5000,
show = c("coords", "phylo", "attr", "loc", "time", "rem"))
#> Error in !sapply(data, function(l) is.null(l) | (ncol(l) == 0) | (nrow(l) == : invalid argument type
rownames(dat) <- NULL
#> Error in rownames(dat) <- NULL: object 'dat' not found
Created on 2018-07-30 by the reprex package (v0.2.0).
Hi,
I have updated all my packages and still get this error.
This is my session info:
Session info -----------------------------------------------------------------------------------------------------
setting value
version R version 3.5.1 (2018-07-02)
system x86_64, mingw32
ui RStudio (1.1.453)
language (EN)
collate English_United Kingdom.1252
tz Europe/Paris
date 2018-07-31
Packages ---------------------------------------------------------------------------------------------------------
package version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0)
base 3.5.1 2018-07-02 local
bindr 0.1.1 2018-03-13 CRAN (R 3.5.0)
bindrcpp 0.2.2 2018-03-29 CRAN (R 3.5.0)
bitops 1.0-6 2013-08-17 CRAN (R 3.5.0)
colorspace 1.3-2 2016-12-14 CRAN (R 3.5.0)
compiler 3.5.1 2018-07-02 local
CoordinateCleaner 2.0-1 2018-07-31 Github (azizka/CoordinateCleaner@1d95007)
countrycode 1.00.0 2018-02-11 CRAN (R 3.5.0)
crayon 1.3.4 2017-09-16 CRAN (R 3.5.0)
curl 3.2 2018-03-28 CRAN (R 3.5.0)
datasets 3.5.1 2018-07-02 local
devtools 1.13.6 2018-06-27 CRAN (R 3.5.1)
digest 0.6.15 2018-01-28 CRAN (R 3.5.0)
dplyr 0.7.6 2018-06-29 CRAN (R 3.5.1)
geosphere 1.5-7 2017-11-05 CRAN (R 3.5.0)
ggplot2 3.0.0 2018-07-03 CRAN (R 3.5.1)
git2r 0.23.0 2018-07-17 CRAN (R 3.5.1)
glue 1.3.0 2018-07-17 CRAN (R 3.5.1)
graphics 3.5.1 2018-07-02 local
grDevices 3.5.1 2018-07-02 local
grid 3.5.1 2018-07-02 local
gtable 0.2.0 2016-02-26 CRAN (R 3.5.0)
gtools 3.8.1 2018-06-26 CRAN (R 3.5.0)
httr 1.3.1 2017-08-20 CRAN (R 3.5.0)
knitr 1.20 2018-02-20 CRAN (R 3.5.0)
lattice 0.20-35 2017-03-25 CRAN (R 3.5.1)
lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.0)
magrittr 1.5 2014-11-22 CRAN (R 3.5.0)
maps 3.3.0 2018-04-03 CRAN (R 3.5.0)
memoise 1.1.0 2017-04-21 CRAN (R 3.5.0)
methods 3.5.1 2018-07-02 local
munsell 0.5.0 2018-06-12 CRAN (R 3.5.1)
paleobioDB 0.5.0 2016-09-02 CRAN (R 3.5.1)
pillar 1.3.0 2018-07-14 CRAN (R 3.5.1)
pkgconfig 2.0.1 2017-03-21 CRAN (R 3.5.0)
plyr 1.8.4 2016-06-08 CRAN (R 3.5.0)
purrr 0.2.5 2018-05-29 CRAN (R 3.5.0)
R6 2.2.2 2017-06-17 CRAN (R 3.5.0)
raster 2.6-7 2017-11-13 CRAN (R 3.5.0)
Rcpp 0.12.18 2018-07-23 CRAN (R 3.5.1)
RCurl 1.95-4.11 2018-07-15 CRAN (R 3.5.1)
rgeos 0.3-28 2018-06-08 CRAN (R 3.5.1)
rjson 0.2.20 2018-06-08 CRAN (R 3.5.0)
rlang 0.2.1 2018-05-30 CRAN (R 3.5.0)
rstudioapi 0.7 2017-09-07 CRAN (R 3.5.0)
scales 0.5.0 2017-08-24 CRAN (R 3.5.0)
sp 1.3-1 2018-06-05 CRAN (R 3.5.0)
stats 3.5.1 2018-07-02 local
tibble 1.4.2 2018-01-22 CRAN (R 3.5.0)
tidyselect 0.2.4 2018-02-26 CRAN (R 3.5.0)
tools 3.5.1 2018-07-02 local
utils * 3.5.1 2018-07-02 local
withr 2.1.2 2018-03-15 CRAN (R 3.5.0)
yaml 2.2.0 2018-07-25 CRAN (R 3.5.1)
I think the problem is in paleobioDB package, particularly when using gtools::smartbind to make a data frame with records. This is my traceback of the error:
6: smartbind(df, reg) 5: .make_data_frame(data_list[["records"]]) 4: .parse_raw_data(raw_data) 3: .get_data_from_uri(uri) 2: .pbdb_query("occs/list", query = l) 1: paleobioDB::pbdb_occurrences(base_name = "Magnoliopsida", vocab = "pbdb", limit = 5000, show = c("coords", "phylo", "attr", "loc", "time", "rem"))
Maybe if you file an issue at paleobioDB they can help fix it
Cheers
Alright, thanks!
This seems to be a problem of paleobioDB with R 3.5.0 and upwards. I posted an issue at the paleobioDB repository (https://github.com/ropensci/paleobioDB/issues/34).
@azizka if that's fine with you I'd suggest we wait for the paleobioDB
issue to be solved before asking the reviewers for their final stamp of approval.
sure, that's fine with me, thanks @maelle . There are some time constraints to for the build to work from my side. So in case the issue with paleobioDB cannot be fixed within the next two weeks or so, I'll try to find alternative data for the vignette, and for instance include it in the package. I hope thats also OK.
Hej @maelle,
since it will take a while until the issue with paleobioDB
will be solved, I have for now included an example of fossil data in the package, which can be used in the vignette. The travis build is passing now.
The test coverage of the packages slightly dropped again due to the code additions from the review. I will work on this, but already post my replies to the second review now, so things can move ahead.
cheers,
Alex
Hej @Pakillo and @maelle ,
thanks so much for the constructive review. I included almost all comments. Below a point-by-point reply:
Installation of current version from GitHub takes a 93 MB download, which seems maybe too much? Apparently ~70 MB of it is git history and files (.git folder). I wonder if there is some big file hidden in history, which could perhaps be removed (e.g. using https://rtyley.github.io/bfg-repo-cleaner/). That package size was not a problem for me but could be for some users with slow connections. I think rOpenSci experts can advise more on this, or confirm this is actually OK.
Done. Used bfg as suggested, reduced repository size to c. 10 MB.
Apart from git history, there are some big files in the current version of the repo. First, two largish PDF files of rendered vignettes in Tutorials folder which can probably be removed? I am not sure about rOpenSci conventions here, but maybe you can store just HTML rendered versions of the vignettes (which weigh much less, ~110 KB for the GBIF vignette) in a gh-pages branch so they can be seen nicely. Or, even better, use pkgdown to create a website for your package, including not only rendered vignettes but also rendered examples, etc.
Done. Removed the tutorials folder, and included all tutorials as vignettes in the package. Created a pkgdown page for the repository as suggested (https://azizka.github.io/CoordinateCleaner/.). Also added the former wiki as articles to the page.
The other second set of large files are data files (particularly GIS datasets like landmass.rda, >2 MB, or urbanareas.rda, >10MB). As most of these datasets actually come from Natural Earth, I think they could perhaps be downloaded on purpose when needed, probably using rnaturalearth package. Also, some of these datasets are directly available in rnaturalearthdata and rnaturalearthhires packages (all already part of rOpenSci), so you could maybe just use them so as not to duplicate files.
Partly done. Removed the urbanareas.rda and switched to data from rnaturalearth::ne_download as default for cc_urb. I kept the landmass.rda, since it is used extensively in the package, an no equivalent is directly available from naturalearth (to my knowledge).
The Readme is good, just a few typos and minor comments. I have also made a pull request fixing some typos I found along the way.
Done accepted the pull request
Typo in badge linking to Travis (fixed in pull-request). Also, Travis is giving build error, but I see you are already on track.
Done. Fixed the typo and solved the issue with the travis build error related to paleobioDB; travis build is passing now
Provide hyperlinks to NEWS, vignettes, contributing files, so the reader just have to click from README to see them?
Done. Added the links.
When discussing related packages, I would also mention biogeo (https://github.com/cran/biogeo).
Done. Added the link.
Use README.Rmd? So the reader can see the output of the example code in README without having to run it by themselves. Maybe do not show all output, but some of it.
Not done. I prefer to not show output in the Readme, since I find it is this reduces readability and makes it more difficult to find relevant information. The vignettes do show detailed output of most functions
Current example code gives errors. I think it needs updating after the API changes: dsl <- clean_coordinatesDS(exmpl) gives error (Changed to dsl <- clean_dataset(exmpl) in pull request) clean_coordinatesFOS must be changed to clean_fossils dc_ddmm does not exist. Change to cd_ddmm? tc_range does not exist. Has it changed to cf_range?
Done. Fixed error due to API changes following the first review. Sorry about that.
As explained above, consider removing large data files from the package? Use rnaturalearth and associated packages instead (as some functions, e.g. cc_coun, already do). Maybe only buffland_1deg (currently in extra_gazetteers) would need to be included.
Done. See above.
Looks like data in capitals.rda (and parts of centroids.rda) are already contained in countryref.rda. If possible, avoid duplication.
Done. Combined all relevant data in one dataset (“countryref”)
institutions: If there is an script used to build this data frame, it might be good to include it (e.g. in a data-raw folder) for the sake of reproducibility and possibility of making further changes in the future.
Not done. There were scripts involved for the automated geo-referencing and data wrangling, but since most of the data had to be georeferenced manually, there is no consistent pipeline.
extra-gazetteers
urbanareas.rda is >10 MB big, and countryborders >2 MB. As all these seem to come from Natural Earth, consider using rnaturalearth to access them when needed?
Done. Removed the files from the extra gazetteers folder, they are now fetched using rnaturalearth on demand
institutions_utf8 seems duplicated with institutions.rda in data folder? (only having 1 row more). Keep only one?
Done. Removed institutions_utf8
I think the documentation (Rd) files in this folder are not available to the user. Move all the data to the data folder, and make sure all are documented?
Done. Removed institutions_utf8 and urbanareas (see above) and moved, buffland_1deg to the data folder and added documentation.
Rd file for CoordinateCleaner-package.Rd needs to be revised. Seems the conversion to ROxygen hasn't worked well.
Done. Fixed the package documentation.
I think the help files for CleanCoordinates, CleanCoordinatesDS, CleanCoordinatesFOS, should mark them as Deprecated, see https://ropensci.github.io/dev_guide/evolution.html#functions-deprecate-defunct.
Done. I created a deprecated help file as suggested. I have a question however: the file can be accessed without a problem via ??CoordinateCleaner, but ?CoordinateCleaner-deprecated throws an error. Any ideas why? @Pakillo, @maelle
Default longitude and latitude arguments for fossils functions (cf_outl, cf_range...) are "lng" and "lat", respectively, while the params explanation states the default are "decimallongitude" and "decimallatitude" (as in the rest of the package). Use the latter throughout?
Done. The functions now consistently use “decimallongitude” and “decimallatitude”. The “cf_...” functions also accept “lat” and “lng” (albeit with issuing a warning) since this is the default in the Paleobiology database .
Several functions using Spatial* objects assume they already have a geographical (latlon) projection (e.g. cc_cap, L67-68). As a suggestion, given they are Spatial objects it might be better to read their projection (proj4string) directly and ensure it is indeed correct. Otherwise, give warning and optionally make reprojection on the fly? (e.g. using spTransform).
Done. Projection testing and reprojection for custom references implemented as suggested.
Also, as sf will become more and more popular consider enabling them also as arguments, not only sp objects (e.g. doing an internal conversion as(sf, "Spatial")).
Done. Implemented as suggested enabling sf style custom references
cc_coun: Although there exists cc_sea, I think it would be good if this function flagged every coordinate not within the supposed country, even if those coordinates fall on sea.
Done. The function now flags marine records.
cc_sea gives TRUE if coordinates fall on land, and FALSE if coordinates fall on sea. I think this is confusing and may induce errors in future users (expecting TRUE if falling on sea, given the function name). To respect the convention across the whole package about TRUE/FALSE, maybe then rename the function as cc_land? Which would give TRUE when a location really falls on land.
Not done,( see comment below). The naming scheme is in a way that the name of each function is associated with the problem it tests (e.g. cc_equal, cc_gbif), and since it is consistent, I hope it can stay as is.
cc_urb: download urbanareas from Natural Earth on demand?
Done.
plot.spatialvalid seems to use landmass.rda for plotting, but it could also use borders from ggplot2 (probably lighter and faster, and in case landmass is finally taken out of the package).
Done. Switched to ggplot2:borders for plotting, much faster now.
clean_dataset: the example flags <- clean_dataset(test) gives warning: In FUN(X[[i]], ...) : Geographic spann to small, check 'min_span'
Done. Changed the geographic span of the example dataset.
Dependencies
The tidyverse package is listed in Suggests and effectively loaded in vignettes, but I think only a small subset of tidyverse packages are needed? Maybe better to use only those?
Done. Removed t idyverse from the vignettes and the description file and replaced by dplyr+ggplot2
Is the Tutorials folder with rendered PDF vignettes necessary? Can use lighter HTML vignettes than PDF? Use pkgdown?
Done. Removed pdf vignettes. Created webpage using pkgdown (https://azizka.github.io/CoordinateCleaner/.)
The PBDB vignette PBDB doesn't run for me. It actually seems a problem with the paleobioDB package? dat <- pbdb_occurrences(base_name = "Magnoliopsida", vocab = "pbdb", limit = 5000, show = c("coords", "phylo", "attr", "loc", "time", "rem")) Error in !sapply(data, function(l) is.null(l) | (ncol(l) == 0) | (nrow(l) == : invalid argument type
Done. Removed the download using paleobiologyDB.
The vignettes need a further revision to update links, pkg versions, typos, etc.
Done. corrected spelling, removed reference to package version and reformatted links.
Thanks @azizka! @Pakillo , can you have a look at @azizka's answer?
Reg the pkgdown
website I'd recommend grouping functions cf https://ropensci.github.io/dev_guide/building.html#function-grouping
Could you ask your question about the deprecation in rOpenSci's forum @azizka if @Pakillo can't answer? discuss.ropensci.org 😺
Thanks @azizka
Re: the problem with consulting deprecated functions, I think there is a typo here: https://github.com/azizka/CoordinateCleaner/blob/master/R/CoordinateCleaner-package.R#L84 (should be capital C)
Then, to access the help file for deprecated functions (for any package), I need to include it within ``:
?`CoordinateCleaner-deprecated`
Does that work for you?
@azizka Any news?
Hej,
yes it does, I fixed it as suggested. Thanks @Pakillo.
@pakillo @isteves are you both now happy with the package? If so can you check "The author has responded to my review and made changes to my satisfaction. I recommend approving this package." in your review? Thanks!
Hi,
Yes, I'm very happy with the package. Many thanks @azizka for the hard work, @isteves for your comments, and @maelle for your excellent editorial work.
A few final comments:
Travis is still giving error due to undocumented arguments in cc_inst
(see Travis log). Please fix.
There are a number of broken URL links across function help files, vignettes... (e.g. many pointing to the now deleted wiki, or to the former extra-gazetteers
folder). Please revise. (I wonder if there is a tool to check for broken links in R packages @maelle ?)
Not required, but I agree it would be very helpful to group functions by theme (e.g. cleaning records, datasets, fossils...) in the pkgdown website as @maelle recommended (https://ropensci.github.io/dev_guide/building.html#function-grouping)
The help file for deprecated functions in the pkgdown website appears missing (https://azizka.github.io/CoordinateCleaner/reference/CoordinateCleaner-deprecated.html). But it seems to be due to the same typo in the package name we saw before (missing capital C), because https://azizka.github.io/CoordinateCleaner/reference/Coordinatecleaner-deprecated.html does work for me. Please ensure that Coordinatecleaner-deprecated.Rd
gets changed to CoordinateCleaner-deprecated
after Roxygenising.
At the Documentation section in the GitHub repo Readme, maybe point to rendered vignettes (i.e. https://azizka.github.io/CoordinateCleaner/articles/, at website built with pkgdown) rather than Rmd sources on github (https://github.com/azizka/CoordinateCleaner/tree/master/vignettes)?
Also at the Readme, when talking about other related packages, maybe point to the vignette comparing their functionality? (i.e. https://azizka.github.io/CoordinateCleaner/articles/comparison_other_software.html). I think this will particularly help people approaching the package for the first time and having to decide which one to use, or how CoordinateCleaner differs from the others.
Hope this helps.
Cheers!
I wonder if there is a tool to check for broken links in R packages
At least CRAN must have one... will ask in Slack.
Hi,
I went through @Pakillo latest changes (thanks!):
Travis is still giving error due to undocumented arguments in cc_inst (see Travis log). Please fix.
Done. Travis build is succeeding now.
There are a number of broken URL links across function help files, vignettes... (e.g. many pointing to the now deleted wiki, or to the former extra-gazetteers folder). Please revise. (I wonder if there is a tool to check for broken links in R packages @maelle ?)
Done. Replaced the links to the wiki with links to the documentation page and removed the link to the extra gazetteers.
Not required, but I agree it would be very helpful to group functions by theme (e.g. cleaning records, datasets, fossils...) in the pkgdown website as @maelle recommended (https://ropensci.github.io/dev_guide/building.html#function-grouping)
Done. Grouped the functions according to the cleaning target.
The help file for deprecated functions in the pkgdown website appears missing (https://azizka.github.io/CoordinateCleaner/reference/CoordinateCleaner-deprecated.html). But it seems to be due to the same typo in the package name we saw before (missing capital C), because https://azizka.github.io/CoordinateCleaner/reference/Coordinatecleaner-deprecated.html does work for me. Please ensure that Coordinatecleaner-deprecated.Rd gets changed to CoordinateCleaner-deprecated after Roxygenising.
Done. CoordinateCleaner-deprecated is working now.
At the Documentation section in the GitHub repo Readme, maybe point to rendered vignettes (i.e. https://azizka.github.io/CoordinateCleaner/articles/, at website built with pkgdown) rather than Rmd sources on github (https://github.com/azizka/CoordinateCleaner/tree/master/vignettes)?
Done. Changed as suggested
Also at the Readme, when talking about other related packages, maybe point to the vignette comparing their functionality? (i.e. https://azizka.github.io/CoordinateCleaner/articles/comparison_other_software.html). I think this will particularly help people approaching the package for the first time and having to decide which one to use, or how CoordinateCleaner differs from the others.
Done. Added the links.
How will things proceed now?
Cheers,
Alex
We'll wait for @isteves to chime in when she has time, and then we should be close to the end of this process. 😺
Hey all,
I took another brief look through the package. Rather than getting deep into the code, I checked it out the way I would any new package--I looked through the pkgdown docs, tested out the examples, and checked out bits and pieces of the documentation.
I'm a fan of the pkgdown website. I think it's a great addition, and it made it easy to re-orient myself to the package. Specifically, I liked:
(note--mostly just checked the pkgdown version of these...)
In the R-environment the scrubr an biogeo offer cleaning approaches complementary to CoordinateCleaner.
Sidenote: My personal preference would be to avoid starting a sentence with a lower-case package name, but it’s up to you: scrubr combines basic geographic cleaning… -> The scrubr package combines…
https://azizka.github.io/CoordinateCleaner/articles/Background_the_institutions_database.html#data-compilation Image not showing. It's trying to link to this
Identifying erroneous coordinates using
We based the choice of tests on common problems observed in biological collection databases (see for example (???))
poison
💀
broom::augment
.clean_dataset
confused me. Given the description ("Identifies potentially problematic coordinates..."), I was expecting to get flagged coordinates as an output, rather the output of a test. Something like this may help: "Tests for problems associated with coordinate conversions and rounding...." Perhaps documenting the output with @return
could also helppath <- file.path(system.file(package = "CoordinateCleaner"), "sea.EXT")
if(!file.exists(path)) {
download_sea(path)
}
sea <- load_sea(path)
exmpl %>%
as_tibble() %>%
mutate(val = cc_val(., value = "flagged"),
sea = cc_sea(., value = "flagged"))
@inheritParams
/@inherit
tags, which was such a game-changer! You can define your parameter(s) once and use @inheritParams fxn_name
for any other function that uses the parameter(s) (and same for sections). Definitely give it a try at some point if you haven't already 😄 I think it's almost good to go!
-Irene
Many thanks @isteves !
Thanks @isteves!
I have now included almost all suggestions; please see a point-by-point reply below. I am very sorry for the typos; the documentation has become so large now that some always slip through.
Thanks especially for the hint with @inherit. I am also continuously working to increase the test coverage. Cheers,
Alex
In the R-environment the scrubr an biogeo offer cleaning approaches complementary to CoordinateCleaner.
Fixed. Changed to ‘and’.
Sidenote: My personal preference would be to avoid starting a sentence with a lower-case package name, but it’s up to you: scrubr combines basic geographic cleaning… -> The scrubr package combines…
Fixed. Changed as suggested.
Image not showing. It's trying to link to this
Fixed. Removed figure and replaced by a link.
Identifying erroneous coordinates using
Fixed. Changed to ‘with’.
We based the choice of tests on common problems observed in biological collection databases (see for example (???))
Fixed. Fixed citation.
poison :skull:
Fixed. Changed to ‘Poisson’. :fish:
Increase point size? I thought I was looking at a blank plot at first...
Not fixed. I completely agree, that the points are hard to see which is anoying. However, this plot shows the analyses matrix as used during the test. So, in this case it was a 1000 x 1000 matrix, hence the resulting diagnostic plot has the same resolution and individual cells are plotted very small. I think it is preferably to live with this visualization to keep the link between the test and the output plot, also in the vignette.
I like the output of clean_coordinates, but I found it strange that some columns from the initial data were dropped. I was expecting behavior more similar to broom::augment.
Fixed. Changed as suggested. The “spatialvalid” output of clean_coordinates and clean_fossils now add the test columns to x, similar to broom::augment.
The documentation of clean_dataset confused me. Given the description ("Identifies potentially problematic coordinates..."), I was expecting to get flagged coordinates as an output, rather the output of a test. Something like this may help: "Tests for problems associated with coordinate conversions and rounding...." Perhaps documenting the output with @return could also help
Fixed. Changed the description as suggested, and improved the documentation of the output with @return.
for the cc_* functions, the default behavior is to filter out "bad" records, but the message says "Flagged XX records." For me, "flagged" ~ "marked." Changing the wording to "removed"/"omitted"/"filtered out" would more clearly indicate to me that the function removes records.
Fixed. Changed the wording in the function title to “Identify” and in the description to “Removes or flags”.
cc_sea downloads something each time I run it. Perhaps you can download the relevant files just the first time the user runs the function with something along the lines of: path <- file.path(system.file(package = "CoordinateCleaner"), "sea.EXT")
if(!file.exists(path)) { download_sea(path) }
sea <- load_sea(path)
Fixed. Implemented as suggested for cc_sea and cc_urb
I like your example of the cc_* functions + the pipe. An example where flags are added as extra columns may also be worth including: exmpl %>% as_tibble() %>% mutate(val = cc_val(., value = "flagged"), sea = cc_sea(., value = "flagged"))
Fixed. Added the suggested example to the “Quickstart_Flagging_problematic_coordinates_in_a_nutshell” and Tutorial_Cleaning_GBIF_data_with_CoordinateCleaner vignettes.
I recently learned about the @inheritParams/@inherit tags, which was such a game-changer! You can define your parameter(s) once and use @inheritParams fxn_name for any other function that uses the parameter(s) (and same for sections). Definitely give it a try at some point if you haven't already
Fixed. Used @inheritParams and @inherit throughout the documentation. Very nice, thanks!
Hey @azizka, it looks great! @maelle -- anything else we should do?
Summary
What does this package do? (explain in 50 words or less): Identify problematic records in large databases of biological and palaeontological collections, to improve data quality for analyses in biogeography, ecology and conservation.
Paste the full DESCRIPTION file inside a code block below:
URL for the package (the development repository, not a stylized html page): https://github.com/azizka/CoordinateCleaner
Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):
geospatial data, because the package deals with improving data quality of occurrence records of biological specimens
reproducible research, because the package replaces potentially badly documented ad-hoc decisions from GUI GIS with clearly defined functions
data munging, becasue the packages processes commonly used geospatial data (geographic coordinates)
Who is the target audience and what are scientific applications of this package? Anybody using geographic coordinates from biological collections on a large scale, thus mostly researchers in biogeography, (maco-)ecology, evolutionary biology and conservation practitioners
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? Yes, scrubr, see this pre-submission enquiry: https://github.com/ropensci/onboarding/issues/199
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted. https://github.com/ropensci/onboarding/issues/199, @sckott
Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.Detail
[x] Does
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:[x] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions: Exceptions:
NEWS file - there is one now, but only started with the latest version.
Package name: Unfortunately, the name contains capital letters (CoordinateCleaner), but it is on CRAN already.
_Function naming: Functions for individual tests are snakecase and pipe compatible, wrapper function around all tests are CamelCase.
Documentation: not built with roxygen.
If this is a resubmission following rejection, please explain the change in circumstances: Nope
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
SaraVarela, sckott