azizka commented 6 years ago

Summary

What does this package do? (explain in 50 words or less): Identify problematic records in large databases of biological and palaeontological collections, to improve data quality for analyses in biogeography, ecology and conservation.
Paste the full DESCRIPTION file inside a code block below:

Package: CoordinateCleaner
Type: Package
Title: Automated Cleaning of Occurrence Records from Biological Collections
Version: 1.1-0
Date: 2018-04-08
Authors@R: c(person(given = "Alexander", family = "Zizka", email = "alexander.zizka@bioenv.gu.se",
                    role = c("aut", "cre")),
             person(given = "Daniele", family = "Silvestro", role = c("ctb")))
Description: Automated cleaning of geographic species occurrence records by automated flagging of problems common to biodiversity data from biological collections. Includes automated tests to easily flag (and exclude) records assigned to country or province centroid, the open ocean, the headquarters of the Global Biodiversity Information Facility, urban areas or the location of biodiversity institutions (museums, zoos, botanical gardens, universities). Furthermore identifies per species outlier coordinates, zero coordinates, identical latitude/longitude and invalid coordinates. Also implements an algorithm to identify data sets with a significant proportion of rounded coordinates. Especially suited for large data sets. See <https://github.com/azizka/CoordinateCleaner/wiki> for more details and tutorials.
License: GPL-3
Depends: R (>= 3.0.0), sp
Imports: geosphere, ggplot2, methods, raster, rgeos, rnaturalearth, stats
LazyData: true
RoxygenNote: 6.0.1
Suggests: testthat, covr

URL for the package (the development repository, not a stylized html page): https://github.com/azizka/CoordinateCleaner
Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):
geospatial data, because the package deals with improving data quality of occurrence records of biological specimens
reproducible research, because the package replaces potentially badly documented ad-hoc decisions from GUI GIS with clearly defined functions
data munging, becasue the packages processes commonly used geospatial data (geographic coordinates)
Who is the target audience and what are scientific applications of this package? Anybody using geographic coordinates from biological collections on a large scale, thus mostly researchers in biogeography, (maco-)ecology, evolutionary biology and conservation practitioners
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? Yes, scrubr, see this pre-submission enquiry: https://github.com/ropensci/onboarding/issues/199
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted. https://github.com/ropensci/onboarding/issues/199, @sckott

Requirements

Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses. The vignette is not part of the package. There are extensive tutorials (https://github.com/azizka/CoordinateCleaner/tree/master/Tutorials) and a wiki (https://github.com/azizka/CoordinateCleaner/wiki) on github
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage, using services such as Travis CI, Coveralls and/or CodeCov.
[x] I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

[x] Do you intend for this package to go on CRAN? it is on CRAN already https://cran.r-project.org/web/packages/CoordinateCleaner/index.html
[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:
- [ ] The package has an obvious research application according to JOSS's definition.
- [ ] The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
- [ ] The package is deposited in a long-term repository with the DOI:
- (Do not submit your package separately to JOSS)
[x] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
- [x] The package is novel and will be of interest to the broad readership of the journal.
- [x] The manuscript describing the package is no longer than 3000 words.
- [x] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
- (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no gaurantee that your manuscript willl be within MEE scope.)
- (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
- (Please do not submit your package separately to Methods in Ecology and Evolution) The manuscript is already submitted to MEE. It had to be submitted before April 1st.

Detail

[x] Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:
[x] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions: Exceptions:
NEWS file - there is one now, but only started with the latest version.
Package name: Unfortunately, the name contains capital letters (CoordinateCleaner), but it is on CRAN already.
_Function naming: Functions for individual tests are snakecase and pipe compatible, wrapper function around all tests are CamelCase.
Documentation: not built with roxygen.
If this is a resubmission following rejection, please explain the change in circumstances: Nope
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

SaraVarela, sckott

maelle commented 6 years ago

Editor checks:

[x] Fit: The package meets criteria for fit and overlap
[x] Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
[x] License: The package has a CRAN or OSI accepted license
[x] Repository: The repository link resolves correctly
[ ] Archive (JOSS only, may be post-review): The repository DOI resolves correctly
[ ] Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Thanks for your submission @azizka! :grin: Initial checks revealed a few issues that you should solve before I assign reviewers. Do not hesitate to ask any question you might have!

Your package doesn't use roxygen2 yet for documentation. Our packaging guide says we strongly encourage it but we'd actually like to start enforcing it, partly to help future contributors. To convert your existing documentation to roxygen2 documentation you can use the Rd2roxygen package. When using roxygen2 you won't have to updated NAMESPACE by hand.
What is the licence on extra_gazeetters? Could these files live in a separate data package? (question, not request for change)
Why have the tar.gz and zip of your package in the repo especially since you already make them available via the use of GitHub releases?
Here is goodpractice::gp() output with comments.

It is good practice to

  ? write unit tests for all functions, and all package
    code in general. 62% of code lines are covered by test cases.

    R/cc_cap.R:18:NA
    R/cc_cap.R:19:NA
    R/cc_cap.R:29:NA
    R/cc_cen.R:19:NA
    R/cc_cen.R:21:NA
    ... and 580 more lines

You do not need to reach 100% code coverage but you can increase it a bit.


  ? not use "Depends" in DESCRIPTION, as it can cause
    name clashes, and poor interaction with other packages. Use
    "Imports" instead.

Feel free to ignore this if you need the sp Depends. By the way, is there any reason why you don't use the more modern sf? (question, not request for change)

  ? omit "Date" in DESCRIPTION. It is not required and it
    gets invalid quite often. A build date will be added to the
    package when you perform `R CMD build` on it.

Please remove Date from there.

  ? add a "URL" field to DESCRIPTION. It helps users find
    information about your package online. If your package does not
    have a homepage, add an URL to GitHub, or the CRAN package
    package page.
  ? add a "BugReports" field to DESCRIPTION, and point it
    to a bug tracker. Many online code hosting services provide bug
    trackers for free, https://github.com, https://gitlab.com, etc.

Make it easy to find your GitHub repo by adding these links.

  ? avoid long code lines, it is bad for readability.
    Also, many people prefer editor windows that are about 80
    characters wide. Try make your lines shorter than 80 characters

    R\cc_cap.R:1:1
    R\cc_cap.R:24:1
    R\cc_cen.R:1:1
    R\cc_coun.R:1:1
    R\cc_coun.R:17:1
    ... and 210 more lines

Maybe styler can help with this.


  ? avoid sapply(), it is not type safe. It might return
    a vector, or a list, depending on the input data. Consider
    using vapply() instead.

    R\WritePyRate.R:88:68

Please change sapply in favor of vapply here.


  ? not import packages as a whole, as this can cause
    name clashes between the imported packages. Instead, import
    only the specific functions you need.

See this part of Hadley Wickham's R package book

  ? not use exportPattern in NAMESPACE. It can lead to
    exporting functions unintendedly. Instead, export functions
    that constitute the external API of your package.

When using roxygen2 syntax you'll use the export tag before each function to be exported.

  ? fix this R CMD check NOTE: Note: found 1 marked
    Latin-1 string

Not sure what this means actually.

  ? fix this R CMD check NOTE: Found the following hidden
    files and directories: .travis.yml These were most likely
    included in error. See section 'Package structure' in the
    'Writing R Extensions' manual.

Please put such files and folders in .Rbuildignore.

Please update this thread once you've done the changes, or as soon as you have a question!

Reviewers: @isteves @pakillo Due date: 2018-06-07

maelle commented 6 years ago

@azizka do you have any question or comment? ☺

azizka commented 6 years ago

Hi @maelle sounds all good, thanks. I am working to include most of your suggestions. I'll defend my PhD next week, so little slow working on the package at the moment, but will tackle this full on afterwards.

maelle commented 6 years ago

Oh good luck with your defence!

maelle commented 6 years ago

@azizka I hope your defence went well! I'm taking the risk to say congrats! 🎊

Did you have time to work on the package a bit?

azizka commented 6 years ago

@maelle yes, it worked out fine, thanks for congratulations and your patience! I am back to work this week, working on the package now.

maelle commented 6 years ago

🎊 Grattis! 🎊

and cool!

azizka commented 6 years ago

Hi @maelle,

Please find below replies to your comments. I hope you find everything covered, please let me know what else I can do. Thanks again for the patience :-).

Note: the most recent version is working, I’ll updated the badges as soon as the problems with travis are solved (https://github.com/travis-ci/travis-build/pull/1382)

Replies:

Your package doesn't use roxygen2 yet for documentation

Done - Migrated documentation and namespace generation to roxygen2

What is the licence on extra_gazeetters? Could these files live in a separate data package?

The extra gazetteers are open source, they could live in a separate data package, but I am not sure that this is necessary/useful since they have very specialized applicability.

Why have the tar.gz and zip of your package in the repo especially since you already make them available via the use of GitHub releases?

Done - Removed the .tar.gz and zip

You do not need to reach 100% code coverage but you can increase it a bit.

It is at 74% now

Feel free to ignore this if you need the sp Depends. By the way, is there any reason why you don't use the more modern sf? (question, not request for change)

Done - moved sp from ‘Depends’ to ‘Imports’. Yes, sf would be great since its faster and tidy, but unfortunately some of the dependencies do not yet support sf formats (e.g. raster, geosphere).

Please remove Date from there

Done - Removed the date from the description file

Make it easy to find your GitHub repo by adding these links.

Done - Added the urls to the description file

Avoid long code lines, it is bad for readability.

Done – cut at 80

Please change sapply in favor of vapply here

Done

not import packages as a whole, instead, import only the specific functions you need

Done

When using roxygen2 syntax you'll use the export tag before each function to be exported.

Done – switched to roxygen2

fix this R CMD check NOTE: Note: found 1 marked, Latin-1 string

I could not reproduce this note, do you have additional information on the operating system and check settings, etc? I suspect it might be related to a non-ASCII character in one of the datasets in the package, but I am not entirely sure either.

Please put such files and folders in .Rbuildignore.

Done, the .travis.yml is part of .Rbuildignore now

maelle commented 6 years ago

@azizka thanks for all your work! Am going to assign reviewers in a bit. Two last comments before the reviews:

could you please add a code of conduct to your repo? cf our guidelines
could you please respond to the two issues opened in your repo? If these are bugs, please solve them before the reviews and update this thread. :-)

maelle commented 6 years ago

:wave: @isteves @Pakillo! Thanks for accepting to review this package! Your reviews are due 2018-06-07 but @pakillo I've already noted that yours might be late (@azizka the reviewers of your package were lined up a while ago :wink:)

Here is our reviewer template and our reviewer guide.

isteves commented 6 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[ ] Examples for all exported functions in R Help that run successfully locally
[ ] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[ ] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[ ] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 7

Review Comments

Hi @azizka - great, comprehensive package! I'm impressed! Here are some comments related to the sections above:

Vignettes
- Examples are currently included in the Tutorials folder & wiki; consider moving them to a vignettes folder so that users can access them locally using browseVignettes()
Examples
- urbanareas: perhaps specify that file path is not the one shown: instead of load("extra_gazetteers/urbanareas.rda"), load("your/path/urbanareas.rda")
- WritePyRate: got an error when I tried WritePyRate(x = exmpl,fname = "test", status = status): Error in [.data.frame(x, , taxon) : undefined columns selected
- A plot call is not present in the example for plot.spatialvalid.
Community guidelines
- No maintainer specified in DESCRIPTION, but maybe ok?
- No contribution guidelines in the README or CONTRIBUTING
Tests
- library(CoordinateCleaner) calls not needed within individual test scripts
- Test_fossillevel_cleaning.R filename should be all lower case for consistency
- I suggest using more descriptive names for context()
- Using devtools::test(), I got the following warning - encoding is deprecated; all files now assumed to be UTF-8 - as well as these:

test_wrapper_functions.R:16: warning: (unknown) running GBIF test, flagging records around Copenhagen

test_wrapper_functions.R:28: warning: CleanCoordinates countries argument produces correct output running GBIF test, flagging records around Copenhagen

test_wrapper_functions.R:96: warning: fossil wrapper cleaning works countries missing, countrycheck set to FALSE

test_wrapper_functions.R:96: warning: fossil wrapper cleaning works running GBIF test, flagging records around Copenhagen

test_wrapper_functions.R:97: warning: fossil wrapper cleaning works countries missing, countrycheck set to FALSE

test_wrapper_functions.R:97: warning: fossil wrapper cleaning works running GBIF test, flagging records around Copenhagen

Packaging guidelines
- according to the guidelines, badges in README.md should be below the package name (unsure how important this is)
- include "how the package compares to other similar packages and/or how it relates to other packages" in README.md
- add citation info to end of in README.md
- "Avoid starting the description with the package name or This package ..." - not sure if this also applies to the README package description, but you may want to change it to match the description in the DESCRIPTION file

Here are some additional comments. Let me know if anything is unclear!

Spelling

All the text/documentation looks great on the whole! The devtools::spell_check() caught two typos that should be fixed:

Herbariorum institutions.Rd:13
THe CleanCoordinates.Rd:101

(The other "spelling errors" that it caught seem to be fine.)

Code duplication/style

For the most part, the package is both clean and organized. I especially like the cc_ functions, which each tackle a specific type of error and have consistent styles.

Among the other files, there was a mix of style/naming conventions. Is there a reason for that? (asking honestly here, since I may be missing something) I understand that for methods.spatialvalid you must use the . convention to work with classes, but I'm unsure why you switched to capitals for CleanCoordinates* and WritePyRate. These functions (as well as tc_* and the tutorials) also include a lot of variables with . rather than _. Changing the variable names so that they use a consistent style would enhance the readability/usability of the code.

In the CleanCoordinates* functions, it seems that you'd be able to greatly reduce the amount of code through some of these structural changes:

use ... to pass arguments specified in CleanCoordinates* to the cc_* functions called within it
rather than writing out TRUE/FALSE for each test, perhaps use a single argument (tests) that can take a vector of desired tests (c("sea", "gbif", "equ"))
set argument defaults to NULL directly in the argument list (capitals.ref = NULL), instead of in an extra step within the function
currently, a variable is returned for each test. If FALSE is chosen, then the returned variable contains only NA's; if TRUE is chosen, the specified test is run and the flagged output is saved. This results in code repetition that may be avoided by (a) specifying the NA output as the default in the beginning of the code and (b) using a data frame/list structure to store the results. To clarify:

Current code

if (capitals) {
    cap <- cc_cap(x,
                  lon = lon, lat = lat, buffer = capitals.rad, ref = capitals.ref,
                  value = "flags", verbose = verbose
    )
} else {
    cap <- rep(NA, dim(x)[1])
}

Proposed change

df <- data.frame(matrix(NA, nrow = nrow(x), ncol = 12))
colnames(df) <- c("val", "equ", "zer", "cap", "cen", "sea", "urb", "con",
"otl", "gbf", "inst", "dpl")

if (capitals) {
    df$cap <- cc_cap(x, value = "flags", ...)
}
#and so on for the other tests
#no `else {cap <- rep(NA, dim(x)[1])}` required

the CleanCoordinates* functions also have warnings/checks that perhaps are more appropriate within the cc_* functions (example)
in the following case (and also this one), perhaps it's better to stop the code, rather than continue running it with a warning. Also, if is.null(countries) is TRUE, then countries does not need to be set to NULL again

if (is.null(countries)) {
    countries <- NULL
    if (countrycheck) {
      countrycheck <- FALSE
      warning("countries missing, countrycheck set to FALSE")
    }
  }

as far as I can tell, cc_dupl can be used in place of this code

In WritePyRate, since fname and status are required, consider moving them earlier in the argument list (positions 2, 3) and not setting a default value (currently NULL)

In cc_val (starting here), there is redundancy in as.numeric/as.character/suppressWarnings that could be reduced. For example, like this:

x[[lon]] <- suppressWarnings(as.numeric(as.character(x[[lon]])))
x[[lat]] <- suppressWarnings(as.numeric(as.character(x[[lat]])))

out <- list(is.na(x[[lon]]),
            is.na(x[[lat]]),
            abs(x[[lon]]) > 180, 
            abs(x[[lat]]) > 90)

Miscellaneous

Currently, the map of the different tests distinguishes clean and flagged points by color, and distinguishes tests by shape. Are the different tests important? If not, perhaps they can grouped into a "flagged" category. If yes, then perhaps color could be used to better distinguish them.

The description of value in the Rd files (for example, here) can be clarified by moving ("flags") up in the sentence:

Depending on the ‘value’ argument, either a data.frame containing the records considered correct by the test (“clean”) or a logical vector (“flags”), with TRUE = test passed and FALSE = test failed/potentially problematic. Default = “clean”

Also, it may help to stick with options for value that are both verbs or both adjectives (but this is just a minor suggestion). For example: clean/flag, clean/flagged, cleaned/flagged.

??CoordinateCleaner has strange output: c("\Sexpr[results=rd,stage=build]tools:::Rd_package_title(\"#1\")", "CoordinateCleaner")Automated Cleaning of Occurrence Records from Biological Collections I'm no sure where this error comes from.

maelle commented 6 years ago

Many thanks for your review @isteves, great points! 😸

No maintainer specified in DESCRIPTION, but maybe ok? It is ok because it's generated automatically from Authors@R when building the package, it's actually better not having it.

@azizka You can wait for @Pakillo's review before responding, or respond before that, since @Pakillo's review was going to be a bit late, so as you prefer.

Thanks again @isteves!

maelle commented 6 years ago

@isteves I forgot to ask you to fill the "Estimated hours spent reviewing:" field in the review template, thanks in advance 🙏

maelle commented 6 years ago

@Pakillo :wave: will you soon have time to review the package? 😺

azizka commented 6 years ago

Hej @maelle and @isteves,

thanks for the comprehensive, helpful and very constructive review. I managed to address most of your issues, and argued back/put questions @maelle for help on some of them. Please find a point-by-point reply below, replies in italics. Please let me know if you have any comments/suggestions.

Thanks again!

Vignettes

Examples are currently included in the Tutorials folder & wiki; consider moving them to a vignettes folder so that users can access them locally using browseVignettes()

Done, moved tutorials to vignettes

Examples

urbanareas: perhaps specify that file path is not the one shown: instead of load("extra_gazetteers/urbanareas.rda"), load("your/path/urbanareas.rda")

Done, changed as suggested

WritePyRate: got an error when I tried WritePyRate(x = exmpl,fname = "test", status = status): Error in [.data.frame(x, , taxon) : undefined columns selected

Done, fixed the code of the example (replaced "identified_name" by "accepted_name")

A plot call is not present in the example for plot.spatialvalid.

Done added plot call to the example.

Community guidelines

No maintainer specified in DESCRIPTION, but maybe ok?

No changes, see @maelle ‘s comment

No contribution guidelines in the README or CONTRIBUTING

Done added a CONTRIBUTING.md

Tests

library(CoordinateCleaner) calls not needed within individual test scripts

Done removed the library() calls

Test_fossillevel_cleaning.R filename should be all lower case for consistency

Done, changed to lowercase

I suggest using more descriptive names for context()

Done

Using devtools::test(), I got the following warning.

Done, switch from warning to message for cc_gbif

Packaging guidelines

according to the guidelines, badges in README.md should be below the package name (unsure how important this is)

Done, moved badges downwards

include "how the package compares to other similar packages and/or how it relates to other packages" in README.md

I am not sure what exactly is needed here. I could add the table comparing CoordinateCleaner with scrubr, from the presubmission query (https://github.com/ropensci/onboarding/issues/199), but this seems excessive. Is this really necessary @maelle?

add citation info to end of in README.md

Done, added a suggested citation, will change to final once the corresponding manuscript is published

"Avoid starting the description with the package name or This package ..." - not sure if this also applies to the README package description, but you may want to change it to match the description in the DESCRIPTION file.

Done, synchronized the first sentences among DESCRIPTION and Readme.md

Spelling

All the text/documentation looks great on the whole! The devtools::spell_check() caught two typos that should be fixed: Herbariorum institutions.Rd:13 THe CleanCoordinates.Rd:101 (The other "spelling errors" that it caught seem to be fine.)

Done, fixed number 2, number 1 is not a typo but the correctly latin spelling of the name of the institution “Index Herbariorum”

Code duplication/style

For the most part, the package is both clean and organized. I especially like the cc functions, which each tackle a specific type of error and have consistent styles. Among the other files, there was a mix of style/naming conventions. Is there a reason for that? (asking honestly here, since I may be missing something) I understand that for methods.spatialvalid you must use the . convention to work with classes, but I'm unsure why you switched to capitals for CleanCoordinates* and WritePyRate. These functions (as well as tc* and the tutorials) also include a lot of variables with . rather than _. Changing the variable names so that they use a consistent style would enhance the readability/usability of the code.

Yes, that is a good point. The wrapper functions originally had CamelCase to show that they differed from the individual tests. I agree that this was inconsistent, and switched to a consistent underscore_case naming scheme. Since this might causes some compatibility problems I increase the version number to 2.x with the latest version.

use ... to pass arguments specified in CleanCoordinates to the cc_ functions called within it

I’d prefer not to do this because (i) this would mean to change the argument names in all the cc functions, and somewhat break their consistency, as for instance the “buffer” argument in cc_cap would need to become cap_buffer and in cc_cen cen_buffer and (ii) while it is true that the list of arguments is long, currently it is obvious also to unexperienced users of CleanCoordinates, that the buffers can be changed and should be thought about.

rather than writing out TRUE/FALSE for each test, perhaps use a single argument (tests) that can take a vector of desired tests (c("sea", "gbif", "equ"))

Done, switched to a character vector for all wrapper functions

set argument defaults to NULL directly in the argument list (capitals.ref = NULL), instead of in an extra step within the function

Done for all wrapper functions

currently, a variable is returned for each test. If FALSE is chosen, then the returned variable contains only NA's; if TRUE is chosen, the specified test is run and the flagged output is saved. This results in code repetition that may be avoided by (a) specifying the NA output as the default in the beginning of the code and (b) using a data frame/list structure to store the results. To clarify:

Done for all wrapper functions

the CleanCoordinates functions also have warnings/checks that perhaps are more appropriate within the cc_ functions (example)

I’d prefer to keep them where they are, since I think this way the error message can be more informative for the user of the CleanCoordinates wrapper function, i.e. it can directly be specified how to fix the problem, and drop certain tests if they are causing the error. For instance, if no country information is available, just omitting the countries test from clean_coordinates might be a good solution, whereas cc_count does absolutely require the country information.

as far as I can tell, cc_dupl can be used in place of this code

Done, replaced by cc_dupl

In WritePyRate, since fname and status are required, consider moving them earlier in the argument list (positions 2, 3) and not setting a default value (currently NULL)

Done, moved the elements up in the list and removed the default.

In cc_val (starting here), there is redundancy in as.numeric/as.character/suppressWarnings that could be reduced.

Done, changed code as proposed

Miscellaneous

Currently, the map of the different tests distinguishes clean and flagged points by color, and distinguishes tests by shape. Are the different tests important? If not, perhaps they can grouped into a "flagged" category. If yes, then perhaps color could be used to better distinguish them.

Done. The marking of the individual tests can be switched on or off with the details option of plot.spatialvalid. I think it can be interesting to visualize which tests failed, for instance to see if there are spatial patterns in data quality, but I agree that it might also be confusing. I switched the default for details to FALSE, so that by defaults now there is only a colour difference between clean and flagged.

The description of value in the Rd files (for example, here) can be clarified by moving ("flags") up in the sentence: Depending on the ‘value’ argument, either a data.frame containing the records considered correct by the test (“clean”) or a logical vector (“flags”), with TRUE = test passed and FALSE = test failed/potentially problematic. Default = “clean”

Done, moved the part in the brackets as suggested

Also, it may help to stick with options for value that are both verbs or both adjectives (but this is just a minor suggestion). For example: clean/flag, clean/flagged, cleaned/flagged.

Done, changed to clean/flagged

??CoordinateCleaner has strange output: c("\Sexpr[results=rd,stage=build]tools:::Rd_package_title(\"#1\")", "CoordinateCleaner")Automated Cleaning of Occurrence Records from Biological Collections I'm no sure where this error comes from.

I have no clue what is going on here. Any ideas @maelle?

azizka commented 6 years ago

An additional question:

The latest build passes without problems, but fails on travis (see below). I assume the vignettes are to RAM heavy. I tried to solve this, but couldn't. Do you have any experience with this problem? Thanks!

Warning in system(command) : system call failed: Cannot allocate memory Warning in system(command) : error in running command Error: processing vignette 'Cleaning_PBDB_fossils_with_CoordinateCleaner.Rmd' failed with diagnostics: pandoc document conversion failed with error 127

maelle commented 6 years ago

Thanks @azizka!

The table comparing CoordinateCleaner with scrubr might be excessive but adding a few lines about how CoordinateCleaner fits into the occurrence data R tooling, and where scrubr might be used in combination, would be good.
Did you generate the package level documentation using usethis::use_package_doc()?
Could you ask the Travis question in Slack to see whether someone had the same issue? I'm afraid I can't help directly.

azizka commented 6 years ago

Alright, I added the following text to the Readme.md (including links): CoordinateCleaner can be particularly useful to ensure geographic data quality when using data from GBIF (e.g. obtained with rgbif) for historical biogeography (e.g. with BioGeoBEARS or phytools), automated conservation assessment (e.g. withspeciesgeocodeR or conR) or species distribution modelling (e.g. with dismo or sdm. See scrubr and taxize for complementary taxonomic cleaning.
No, I migrated the documentation to Roxygen2 with Rd2roxygen as you suggested earlier.
Great, I'll ask on slack.

Thanks :-)

maelle commented 6 years ago

:wave: @pakillo could you please get your review in soon? Thank you! 😺

Pakillo commented 6 years ago

I'm very sorry for the delay -- the end of term workload is being rather crazy. I expect to have it ready this week. Apologies

maelle commented 6 years ago

Thanks for the update @Pakillo!

Pakillo commented 6 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s) demonstrating major functionality that runs successfully locally (Vignettes are great, but one of them failed building in my computer)
[X] Function Documentation: for all exported functions in R help
[X] Examples for all exported functions in R Help that run successfully locally
[X] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[X] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[X] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[X] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 8

Review Comments

Hi all,

Many thanks @azizka for developing CoordinateCleaner, @maelle for the invitation to review (and the patience!), and @isteves for sharing the review and breaking the ice :)

I love CoordinateCleaner. I have already used for my own research and it really fills a gap in terms of cleaning occurrence data (or any coordinate data, really). It brings many new functionalities, it is well thought, relatively easy to use, and it works. So thanks a lot.

I agree with all comments by @isteves above. I think they have greatly contributed to make the package even nicer and more consistent. Some remaining issues or comments:

Package size

Installation of current version from GitHub takes a 93 MB download, which seems maybe too much? Apparently ~70 MB of it is git history and files (.git folder). I wonder if there is some big file hidden in history, which could perhaps be removed (e.g. using https://rtyley.github.io/bfg-repo-cleaner/). That package size was not a problem for me but could be for some users with slow connections. I think rOpenSci experts can advise more on this, or confirm this is actually OK.

Apart from git history, there are some big files in the current version of the repo. First, two largish PDF files of rendered vignettes in Tutorials folder which can probably be removed? I am not sure about rOpenSci conventions here, but maybe you can store just HTML rendered versions of the vignettes (which weigh much less, ~110 KB for the GBIF vignette) in a gh-pages branch so they can be seen nicely. Or, even better, use pkgdown to create a website for your package, including not only rendered vignettes but also rendered examples, etc.

The other second set of large files are data files (particularly GIS datasets like landmass.rda, >2 MB, or urbanareas.rda, >10MB). As most of these datasets actually come from Natural Earth, I think they could perhaps be downloaded on purpose when needed, probably using rnaturalearth package. Also, some of these datasets are directly available in rnaturalearthdata and rnaturalearthhires packages (all already part of rOpenSci), so you could maybe just use them so as not to duplicate files.

README

The Readme is good, just a few typos and minor comments. I have also made a pull request fixing some typos I found along the way.

Typo in badge linking to Travis (fixed in pull-request). Also, Travis is giving build error, but I see you are already on track.
Provide hyperlinks to NEWS, vignettes, contributing files, so the reader just have to click from README to see them?
When discussing related packages, I would also mention biogeo (https://github.com/cran/biogeo).
Use README.Rmd? So the reader can see the output of the example code in README without having to run it by themselves. Maybe do not show all output, but some of it.
Current example code gives errors. I think it needs updating after the API changes:

dsl <- clean_coordinatesDS(exmpl) gives error (Changed to dsl <- clean_dataset(exmpl) in pull request)

clean_coordinatesFOS must be changed to clean_fossils

dc_ddmm does not exist. Change to cd_ddmm?

tc_range does not exist. Has it changed to cf_range?

data

As explained above, consider removing large data files from the package? Use rnaturalearth and associated packages instead (as some functions, e.g. cc_coun, already do). Maybe only buffland_1deg (currently in extra_gazetteers) would need to be included.
Looks like data in capitals.rda (and parts of centroids.rda) are already contained in countryref.rda. If possible, avoid duplication.
institutions: If there is an script used to build this data frame, it might be good to include it (e.g. in a data-raw folder) for the sake of reproducibility and possibility of making further changes in the future.

extra-gazetteers

urbanareas.rda is >10 MB big, and countryborders >2 MB. As all these seem to come from Natural Earth, consider using rnaturalearth to access them when needed?
institutions_utf8 seems duplicated with institutions.rda in data folder? (only having 1 row more). Keep only one?
I think the documentation (Rd) files in this folder are not available to the user. Move all the data to the data folder, and make sure all are documented?

Documentation (`man` folder)

Rd file for CoordinateCleaner-package.Rd needs to be revised. Seems the conversion to ROxygen hasn't worked well.
I think the help files for CleanCoordinates, CleanCoordinatesDS, CleanCoordinatesFOS, should mark them as Deprecated, see https://ropensci.github.io/dev_guide/evolution.html#functions-deprecate-defunct.
Default longitude and latitude arguments for fossils functions (cf_outl, cf_range...) are "lng" and "lat", respectively, while the params explanation states the default are "decimallongitude" and "decimallatitude" (as in the rest of the package). Use the latter throughout?

Functions

Several functions using Spatial* objects assume they already have a geographical (latlon) projection (e.g. cc_cap, L67-68). As a suggestion, given they are Spatial objects it might be better to read their projection (proj4string) directly and ensure it is indeed correct. Otherwise, give warning and optionally make reprojection on the fly? (e.g. using spTransform). Also, as sf will become more and more popular consider enabling them also as arguments, not only sp objects (e.g. doing an internal conversion as(sf, "Spatial")).
cc_coun: Although there exists cc_sea, I think it would be good if this function flagged every coordinate not within the supposed country, even if those coordinates fall on sea.
cc_sea gives TRUE if coordinates fall on land, and FALSE if coordinates fall on sea. I think this is confusing and may induce errors in future users (expecting TRUE if falling on sea, given the function name). To respect the convention across the whole package about TRUE/FALSE, maybe then rename the function as cc_land? Which would give TRUE when a location really falls on land.
cc_urb: download urbanareas from Natural Earth on demand?
plot.spatialvalid seems to use landmass.rda for plotting, but it could also use borders from ggplot2 (probably lighter and faster, and in case landmass is finally taken out of the package).
clean_dataset: the example flags <- clean_dataset(test) gives warning: In FUN(X[[i]], ...) : Geographic spann to small, check 'min_span'

Dependencies

The tidyverse package is listed in Suggests and effectively loaded in vignettes, but I think only a small subset of tidyverse packages are needed? Maybe better to use only those?

Vignettes

Is the Tutorials folder with rendered PDF vignettes necessary? Can use lighter HTML vignettes than PDF? Use pkgdown?
The PBDB vignette PBDB doesn't run for me. It actually seems a problem with the paleobioDB package? dat <- pbdb_occurrences(base_name = "Magnoliopsida", vocab = "pbdb", limit = 5000, show = c("coords", "phylo", "attr", "loc", "time", "rem")) Error in !sapply(data, function(l) is.null(l) | (ncol(l) == 0) | (nrow(l) == : invalid argument type
The vignettes need a further revision to update links, pkg versions, typos, etc.

maelle commented 6 years ago

Thanks a lot for your thorough review @Pakillo! 😀

@azizka now both reviews are in!

isteves commented 6 years ago

@azizka with regards to @Pakillo's tidyverse comment, I saw a post about this exactly on Twitter recently: https://twitter.com/dataandme/status/1009412422476713984?lang=en

The post it links to: https://www.tidyverse.org/articles/2018/06/tidyverse-not-for-packages/

Pakillo commented 6 years ago

Thanks @isteves ! Always good to be backed by Mara and Hadley ;-)

Regarding my comment about cc_sea:

cc_sea gives TRUE if coordinates fall on land, and FALSE if coordinates fall on sea. I think this is confusing and may induce errors in future users (expecting TRUE if falling on sea, given the function name). To respect the convention across the whole package about TRUE/FALSE, maybe then rename the function as cc_land? Which would give TRUE when a location really falls on land.

I later realised that all functions in the package follow the same scheme, e.g. cc_urb also gives TRUE if coordinates are OK and FALSE if coordinates fall in urban area. This scheme might still confuse some quick users at first, because a TRUE result means actually a negative response to the question asked by the function (does it fall on sea? does it fall on urban area? etc - TRUE means NO). But as it is consistent across the package, it's probably OK.

In other words, please ignore my comment above about cc_sea!

Thanks

maelle commented 6 years ago

:wave: @azizka do you have an idea of when you'll be able to respond to @Pakillo's review? Thank you!

azizka commented 6 years ago

Hi @Pakillo, thanks for the great review. @maelle I am currently without stable internet, but will work on this next week. Cheers!

maelle commented 6 years ago

Thanks for the update @azizka, enjoy your time off!

azizka commented 6 years ago

Hej,

I have now addressed most comments of the second review. The last thing remaining is the error reported by @Pakillo with paleobioDB

The PBDB vignette PBDB doesn't run for me. It actually seems a problem with the paleobioDB package? dat <- pbdb_occurrences(base_name = "Magnoliopsida", vocab = "pbdb", limit = 5000, show = c("coords", "phylo", "attr", "loc", "time", "rem")) Error in !sapply(data, function(l) is.null(l) | (ncol(l) == 0) | (nrow(l) == : invalid argument type

I could not reproduce this error, but it also seems to be the problem that crashes travis. @Pakillo which operating system and R version did you use?

maelle commented 6 years ago

Am looking at that vignette. There shouldn't be a chunk wit eval=TRUE with install_github("ropensci/paleobioDB"), since the package is in Suggests.

maelle commented 6 years ago

I get the error too, here it is below with session info.

dat <- paleobioDB::pbdb_occurrences(base_name = "Magnoliopsida", vocab = "pbdb", limit = 5000,
                                    show = c("coords", "phylo", "attr", "loc", "time", "rem"))
#> Error in !sapply(data, function(l) is.null(l) | (ncol(l) == 0) | (nrow(l) == : invalid argument type
rownames(dat) <- NULL
#> Error in rownames(dat) <- NULL: object 'dat' not found

Created on 2018-07-30 by the reprex package (v0.2.0).

Session info

``` r devtools::session_info() #> Session info ------------------------------------------------------------- #> setting value #> version R version 3.5.0 (2018-04-23) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United States.1252 #> tz Europe/Paris #> date 2018-07-30 #> Packages ----------------------------------------------------------------- #> package * version date source #> backports 1.1.2 2017-12-13 CRAN (R 3.5.0) #> base * 3.5.0 2018-04-23 local #> bitops 1.0-6 2013-08-17 CRAN (R 3.5.0) #> colorspace 1.3-2 2016-12-14 CRAN (R 3.5.0) #> compiler 3.5.0 2018-04-23 local #> datasets * 3.5.0 2018-04-23 local #> devtools 1.13.5 2018-02-18 CRAN (R 3.5.0) #> digest 0.6.15 2018-01-28 CRAN (R 3.5.0) #> evaluate 0.10.1 2017-06-24 CRAN (R 3.5.0) #> graphics * 3.5.0 2018-04-23 local #> grDevices * 3.5.0 2018-04-23 local #> grid 3.5.0 2018-04-23 local #> gtools 3.8.1 2018-06-26 CRAN (R 3.5.0) #> htmltools 0.3.6.9001 2018-06-16 Github (rstudio/htmltools@3aee819) #> knitr 1.20 2018-02-20 CRAN (R 3.5.0) #> lattice 0.20-35 2017-03-25 CRAN (R 3.5.0) #> magrittr 1.5 2014-11-22 CRAN (R 3.5.0) #> maps 3.3.0 2018-04-03 CRAN (R 3.5.0) #> memoise 1.1.0 2017-04-21 CRAN (R 3.5.0) #> methods * 3.5.0 2018-04-23 local #> munsell 0.5.0 2018-06-12 CRAN (R 3.5.0) #> paleobioDB 0.5.0 2018-07-30 Github (ropensci/paleobioDB@b122e38) #> plyr 1.8.4 2016-06-08 CRAN (R 3.5.0) #> raster 2.6-7 2017-11-13 CRAN (R 3.5.0) #> Rcpp 0.12.18 2018-07-23 CRAN (R 3.5.0) #> RCurl 1.95-4.11 2018-07-15 CRAN (R 3.5.1) #> rjson 0.2.20 2018-06-08 CRAN (R 3.5.0) #> rmarkdown 1.10 2018-06-11 CRAN (R 3.5.0) #> rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3) #> scales 0.5.0 2017-08-24 CRAN (R 3.5.0) #> sp 1.3-1 2018-06-05 CRAN (R 3.5.0) #> stats * 3.5.0 2018-04-23 local #> stringi 1.2.4 2018-07-23 local #> stringr 1.3.1 2018-05-10 CRAN (R 3.5.0) #> tools 3.5.0 2018-04-23 local #> utils * 3.5.0 2018-04-23 local #> withr 2.1.2 2018-03-15 CRAN (R 3.4.4) #> yaml 2.1.19 2018-05-01 CRAN (R 3.5.0) ```

Pakillo commented 6 years ago

Hi,

I have updated all my packages and still get this error.

This is my session info:

Session info ----------------------------------------------------------------------------------------------------- setting value
version R version 3.5.1 (2018-07-02) system x86_64, mingw32
ui RStudio (1.1.453)
language (EN)
collate English_United Kingdom.1252 tz Europe/Paris
date 2018-07-31

Packages --------------------------------------------------------------------------------------------------------- package version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0)
base 3.5.1 2018-07-02 local
bindr 0.1.1 2018-03-13 CRAN (R 3.5.0)
bindrcpp 0.2.2 2018-03-29 CRAN (R 3.5.0)
bitops 1.0-6 2013-08-17 CRAN (R 3.5.0)
colorspace 1.3-2 2016-12-14 CRAN (R 3.5.0)
compiler 3.5.1 2018-07-02 local
CoordinateCleaner 2.0-1 2018-07-31 Github (azizka/CoordinateCleaner@1d95007) countrycode 1.00.0 2018-02-11 CRAN (R 3.5.0)
crayon 1.3.4 2017-09-16 CRAN (R 3.5.0)
curl 3.2 2018-03-28 CRAN (R 3.5.0)
datasets 3.5.1 2018-07-02 local
devtools 1.13.6 2018-06-27 CRAN (R 3.5.1)
digest 0.6.15 2018-01-28 CRAN (R 3.5.0)
dplyr 0.7.6 2018-06-29 CRAN (R 3.5.1)
geosphere 1.5-7 2017-11-05 CRAN (R 3.5.0)
ggplot2 3.0.0 2018-07-03 CRAN (R 3.5.1)
git2r 0.23.0 2018-07-17 CRAN (R 3.5.1)
glue 1.3.0 2018-07-17 CRAN (R 3.5.1)
graphics 3.5.1 2018-07-02 local
grDevices 3.5.1 2018-07-02 local
grid 3.5.1 2018-07-02 local
gtable 0.2.0 2016-02-26 CRAN (R 3.5.0)
gtools 3.8.1 2018-06-26 CRAN (R 3.5.0)
httr 1.3.1 2017-08-20 CRAN (R 3.5.0)
knitr 1.20 2018-02-20 CRAN (R 3.5.0)
lattice 0.20-35 2017-03-25 CRAN (R 3.5.1)
lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.0)
magrittr 1.5 2014-11-22 CRAN (R 3.5.0)
maps 3.3.0 2018-04-03 CRAN (R 3.5.0)
memoise 1.1.0 2017-04-21 CRAN (R 3.5.0)
methods 3.5.1 2018-07-02 local
munsell 0.5.0 2018-06-12 CRAN (R 3.5.1)
paleobioDB 0.5.0 2016-09-02 CRAN (R 3.5.1)
pillar 1.3.0 2018-07-14 CRAN (R 3.5.1)
pkgconfig 2.0.1 2017-03-21 CRAN (R 3.5.0)
plyr 1.8.4 2016-06-08 CRAN (R 3.5.0)
purrr 0.2.5 2018-05-29 CRAN (R 3.5.0)
R6 2.2.2 2017-06-17 CRAN (R 3.5.0)
raster 2.6-7 2017-11-13 CRAN (R 3.5.0)
Rcpp 0.12.18 2018-07-23 CRAN (R 3.5.1)
RCurl 1.95-4.11 2018-07-15 CRAN (R 3.5.1)
rgeos 0.3-28 2018-06-08 CRAN (R 3.5.1)
rjson 0.2.20 2018-06-08 CRAN (R 3.5.0)
rlang 0.2.1 2018-05-30 CRAN (R 3.5.0)
rstudioapi 0.7 2017-09-07 CRAN (R 3.5.0)
scales 0.5.0 2017-08-24 CRAN (R 3.5.0)
sp 1.3-1 2018-06-05 CRAN (R 3.5.0)
stats 3.5.1 2018-07-02 local
tibble 1.4.2 2018-01-22 CRAN (R 3.5.0)
tidyselect 0.2.4 2018-02-26 CRAN (R 3.5.0)
tools 3.5.1 2018-07-02 local
utils * 3.5.1 2018-07-02 local
withr 2.1.2 2018-03-15 CRAN (R 3.5.0)
yaml 2.2.0 2018-07-25 CRAN (R 3.5.1)

I think the problem is in paleobioDB package, particularly when using gtools::smartbind to make a data frame with records. This is my traceback of the error:

6: smartbind(df, reg) 5: .make_data_frame(data_list[["records"]]) 4: .parse_raw_data(raw_data) 3: .get_data_from_uri(uri) 2: .pbdb_query("occs/list", query = l) 1: paleobioDB::pbdb_occurrences(base_name = "Magnoliopsida", vocab = "pbdb", limit = 5000, show = c("coords", "phylo", "attr", "loc", "time", "rem"))

Maybe if you file an issue at paleobioDB they can help fix it

Cheers

azizka commented 6 years ago

Alright, thanks!

This seems to be a problem of paleobioDB with R 3.5.0 and upwards. I posted an issue at the paleobioDB repository (https://github.com/ropensci/paleobioDB/issues/34).

maelle commented 6 years ago

@azizka if that's fine with you I'd suggest we wait for the paleobioDB issue to be solved before asking the reviewers for their final stamp of approval.

azizka commented 6 years ago

sure, that's fine with me, thanks @maelle . There are some time constraints to for the build to work from my side. So in case the issue with paleobioDB cannot be fixed within the next two weeks or so, I'll try to find alternative data for the vignette, and for instance include it in the package. I hope thats also OK.

azizka commented 6 years ago

Hej @maelle,

since it will take a while until the issue with paleobioDB will be solved, I have for now included an example of fossil data in the package, which can be used in the vignette. The travis build is passing now.

The test coverage of the packages slightly dropped again due to the code additions from the review. I will work on this, but already post my replies to the second review now, so things can move ahead.

cheers,

Alex

azizka commented 6 years ago

Hej @Pakillo and @maelle ,

thanks so much for the constructive review. I included almost all comments. Below a point-by-point reply:

Package size

Installation of current version from GitHub takes a 93 MB download, which seems maybe too much? Apparently ~70 MB of it is git history and files (.git folder). I wonder if there is some big file hidden in history, which could perhaps be removed (e.g. using https://rtyley.github.io/bfg-repo-cleaner/). That package size was not a problem for me but could be for some users with slow connections. I think rOpenSci experts can advise more on this, or confirm this is actually OK.

Done. Used bfg as suggested, reduced repository size to c. 10 MB.

Apart from git history, there are some big files in the current version of the repo. First, two largish PDF files of rendered vignettes in Tutorials folder which can probably be removed? I am not sure about rOpenSci conventions here, but maybe you can store just HTML rendered versions of the vignettes (which weigh much less, ~110 KB for the GBIF vignette) in a gh-pages branch so they can be seen nicely. Or, even better, use pkgdown to create a website for your package, including not only rendered vignettes but also rendered examples, etc.

Done. Removed the tutorials folder, and included all tutorials as vignettes in the package. Created a pkgdown page for the repository as suggested (https://azizka.github.io/CoordinateCleaner/.). Also added the former wiki as articles to the page.

The other second set of large files are data files (particularly GIS datasets like landmass.rda, >2 MB, or urbanareas.rda, >10MB). As most of these datasets actually come from Natural Earth, I think they could perhaps be downloaded on purpose when needed, probably using rnaturalearth package. Also, some of these datasets are directly available in rnaturalearthdata and rnaturalearthhires packages (all already part of rOpenSci), so you could maybe just use them so as not to duplicate files.

Partly done. Removed the urbanareas.rda and switched to data from rnaturalearth::ne_download as default for cc_urb. I kept the landmass.rda, since it is used extensively in the package, an no equivalent is directly available from naturalearth (to my knowledge).

README

The Readme is good, just a few typos and minor comments. I have also made a pull request fixing some typos I found along the way.

Done accepted the pull request

Typo in badge linking to Travis (fixed in pull-request). Also, Travis is giving build error, but I see you are already on track.

Done. Fixed the typo and solved the issue with the travis build error related to paleobioDB; travis build is passing now

Provide hyperlinks to NEWS, vignettes, contributing files, so the reader just have to click from README to see them?

Done. Added the links.

When discussing related packages, I would also mention biogeo (https://github.com/cran/biogeo).

Done. Added the link.

Use README.Rmd? So the reader can see the output of the example code in README without having to run it by themselves. Maybe do not show all output, but some of it.

Not done. I prefer to not show output in the Readme, since I find it is this reduces readability and makes it more difficult to find relevant information. The vignettes do show detailed output of most functions

Current example code gives errors. I think it needs updating after the API changes: dsl <- clean_coordinatesDS(exmpl) gives error (Changed to dsl <- clean_dataset(exmpl) in pull request) clean_coordinatesFOS must be changed to clean_fossils dc_ddmm does not exist. Change to cd_ddmm? tc_range does not exist. Has it changed to cf_range?

Done. Fixed error due to API changes following the first review. Sorry about that.

data

As explained above, consider removing large data files from the package? Use rnaturalearth and associated packages instead (as some functions, e.g. cc_coun, already do). Maybe only buffland_1deg (currently in extra_gazetteers) would need to be included.

Done. See above.

Looks like data in capitals.rda (and parts of centroids.rda) are already contained in countryref.rda. If possible, avoid duplication.

Done. Combined all relevant data in one dataset (“countryref”)

institutions: If there is an script used to build this data frame, it might be good to include it (e.g. in a data-raw folder) for the sake of reproducibility and possibility of making further changes in the future.

Not done. There were scripts involved for the automated geo-referencing and data wrangling, but since most of the data had to be georeferenced manually, there is no consistent pipeline.

extra-gazetteers

urbanareas.rda is >10 MB big, and countryborders >2 MB. As all these seem to come from Natural Earth, consider using rnaturalearth to access them when needed?

Done. Removed the files from the extra gazetteers folder, they are now fetched using rnaturalearth on demand

institutions_utf8 seems duplicated with institutions.rda in data folder? (only having 1 row more). Keep only one?

Done. Removed institutions_utf8

I think the documentation (Rd) files in this folder are not available to the user. Move all the data to the data folder, and make sure all are documented?

Done. Removed institutions_utf8 and urbanareas (see above) and moved, buffland_1deg to the data folder and added documentation.

Documentation (man folder)

Rd file for CoordinateCleaner-package.Rd needs to be revised. Seems the conversion to ROxygen hasn't worked well.

Done. Fixed the package documentation.

I think the help files for CleanCoordinates, CleanCoordinatesDS, CleanCoordinatesFOS, should mark them as Deprecated, see https://ropensci.github.io/dev_guide/evolution.html#functions-deprecate-defunct.

Done. I created a deprecated help file as suggested. I have a question however: the file can be accessed without a problem via ??CoordinateCleaner, but ?CoordinateCleaner-deprecated throws an error. Any ideas why? @Pakillo, @maelle

Default longitude and latitude arguments for fossils functions (cf_outl, cf_range...) are "lng" and "lat", respectively, while the params explanation states the default are "decimallongitude" and "decimallatitude" (as in the rest of the package). Use the latter throughout?

Done. The functions now consistently use “decimallongitude” and “decimallatitude”. The “cf_...” functions also accept “lat” and “lng” (albeit with issuing a warning) since this is the default in the Paleobiology database .

Functions

Several functions using Spatial* objects assume they already have a geographical (latlon) projection (e.g. cc_cap, L67-68). As a suggestion, given they are Spatial objects it might be better to read their projection (proj4string) directly and ensure it is indeed correct. Otherwise, give warning and optionally make reprojection on the fly? (e.g. using spTransform).

Done. Projection testing and reprojection for custom references implemented as suggested.

Also, as sf will become more and more popular consider enabling them also as arguments, not only sp objects (e.g. doing an internal conversion as(sf, "Spatial")).

Done. Implemented as suggested enabling sf style custom references

cc_coun: Although there exists cc_sea, I think it would be good if this function flagged every coordinate not within the supposed country, even if those coordinates fall on sea.

Done. The function now flags marine records.

cc_sea gives TRUE if coordinates fall on land, and FALSE if coordinates fall on sea. I think this is confusing and may induce errors in future users (expecting TRUE if falling on sea, given the function name). To respect the convention across the whole package about TRUE/FALSE, maybe then rename the function as cc_land? Which would give TRUE when a location really falls on land.

Not done,( see comment below). The naming scheme is in a way that the name of each function is associated with the problem it tests (e.g. cc_equal, cc_gbif), and since it is consistent, I hope it can stay as is.

cc_urb: download urbanareas from Natural Earth on demand?

Done.

plot.spatialvalid seems to use landmass.rda for plotting, but it could also use borders from ggplot2 (probably lighter and faster, and in case landmass is finally taken out of the package).

Done. Switched to ggplot2:borders for plotting, much faster now.

clean_dataset: the example flags <- clean_dataset(test) gives warning: In FUN(X[[i]], ...) : Geographic spann to small, check 'min_span'

Done. Changed the geographic span of the example dataset.

Dependencies

The tidyverse package is listed in Suggests and effectively loaded in vignettes, but I think only a small subset of tidyverse packages are needed? Maybe better to use only those?

Done. Removed t idyverse from the vignettes and the description file and replaced by dplyr+ggplot2

Vignettes

Is the Tutorials folder with rendered PDF vignettes necessary? Can use lighter HTML vignettes than PDF? Use pkgdown?

Done. Removed pdf vignettes. Created webpage using pkgdown (https://azizka.github.io/CoordinateCleaner/.)

The PBDB vignette PBDB doesn't run for me. It actually seems a problem with the paleobioDB package? dat <- pbdb_occurrences(base_name = "Magnoliopsida", vocab = "pbdb", limit = 5000, show = c("coords", "phylo", "attr", "loc", "time", "rem")) Error in !sapply(data, function(l) is.null(l) | (ncol(l) == 0) | (nrow(l) == : invalid argument type

Done. Removed the download using paleobiologyDB.

The vignettes need a further revision to update links, pkg versions, typos, etc.

Done. corrected spelling, removed reference to package version and reformatted links.

maelle commented 6 years ago

Thanks @azizka! @Pakillo , can you have a look at @azizka's answer?

Reg the pkgdown website I'd recommend grouping functions cf https://ropensci.github.io/dev_guide/building.html#function-grouping

Could you ask your question about the deprecation in rOpenSci's forum @azizka if @Pakillo can't answer? discuss.ropensci.org 😺

Pakillo commented 6 years ago

Thanks @azizka

Re: the problem with consulting deprecated functions, I think there is a typo here: https://github.com/azizka/CoordinateCleaner/blob/master/R/CoordinateCleaner-package.R#L84 (should be capital C)

Then, to access the help file for deprecated functions (for any package), I need to include it within ``:

?`CoordinateCleaner-deprecated`

Does that work for you?

maelle commented 6 years ago

@azizka Any news?

azizka commented 6 years ago

Hej,

yes it does, I fixed it as suggested. Thanks @Pakillo.

maelle commented 6 years ago

@pakillo @isteves are you both now happy with the package? If so can you check "The author has responded to my review and made changes to my satisfaction. I recommend approving this package." in your review? Thanks!

Pakillo commented 6 years ago

Hi,

Yes, I'm very happy with the package. Many thanks @azizka for the hard work, @isteves for your comments, and @maelle for your excellent editorial work.

A few final comments:

Travis is still giving error due to undocumented arguments in cc_inst (see Travis log). Please fix.
There are a number of broken URL links across function help files, vignettes... (e.g. many pointing to the now deleted wiki, or to the former extra-gazetteers folder). Please revise. (I wonder if there is a tool to check for broken links in R packages @maelle ?)
Not required, but I agree it would be very helpful to group functions by theme (e.g. cleaning records, datasets, fossils...) in the pkgdown website as @maelle recommended (https://ropensci.github.io/dev_guide/building.html#function-grouping)
The help file for deprecated functions in the pkgdown website appears missing (https://azizka.github.io/CoordinateCleaner/reference/CoordinateCleaner-deprecated.html). But it seems to be due to the same typo in the package name we saw before (missing capital C), because https://azizka.github.io/CoordinateCleaner/reference/Coordinatecleaner-deprecated.html does work for me. Please ensure that Coordinatecleaner-deprecated.Rd gets changed to CoordinateCleaner-deprecated after Roxygenising.
At the Documentation section in the GitHub repo Readme, maybe point to rendered vignettes (i.e. https://azizka.github.io/CoordinateCleaner/articles/, at website built with pkgdown) rather than Rmd sources on github (https://github.com/azizka/CoordinateCleaner/tree/master/vignettes)?
Also at the Readme, when talking about other related packages, maybe point to the vignette comparing their functionality? (i.e. https://azizka.github.io/CoordinateCleaner/articles/comparison_other_software.html). I think this will particularly help people approaching the package for the first time and having to decide which one to use, or how CoordinateCleaner differs from the others.

Hope this helps.

Cheers!

maelle commented 6 years ago

I wonder if there is a tool to check for broken links in R packages

At least CRAN must have one... will ask in Slack.

azizka commented 6 years ago

Hi,

I went through @Pakillo latest changes (thanks!):

Travis is still giving error due to undocumented arguments in cc_inst (see Travis log). Please fix.

Done. Travis build is succeeding now.

There are a number of broken URL links across function help files, vignettes... (e.g. many pointing to the now deleted wiki, or to the former extra-gazetteers folder). Please revise. (I wonder if there is a tool to check for broken links in R packages @maelle ?)

Done. Replaced the links to the wiki with links to the documentation page and removed the link to the extra gazetteers.

Not required, but I agree it would be very helpful to group functions by theme (e.g. cleaning records, datasets, fossils...) in the pkgdown website as @maelle recommended (https://ropensci.github.io/dev_guide/building.html#function-grouping)

Done. Grouped the functions according to the cleaning target.

The help file for deprecated functions in the pkgdown website appears missing (https://azizka.github.io/CoordinateCleaner/reference/CoordinateCleaner-deprecated.html). But it seems to be due to the same typo in the package name we saw before (missing capital C), because https://azizka.github.io/CoordinateCleaner/reference/Coordinatecleaner-deprecated.html does work for me. Please ensure that Coordinatecleaner-deprecated.Rd gets changed to CoordinateCleaner-deprecated after Roxygenising.

Done. CoordinateCleaner-deprecated is working now.

At the Documentation section in the GitHub repo Readme, maybe point to rendered vignettes (i.e. https://azizka.github.io/CoordinateCleaner/articles/, at website built with pkgdown) rather than Rmd sources on github (https://github.com/azizka/CoordinateCleaner/tree/master/vignettes)?

Done. Changed as suggested

Also at the Readme, when talking about other related packages, maybe point to the vignette comparing their functionality? (i.e. https://azizka.github.io/CoordinateCleaner/articles/comparison_other_software.html). I think this will particularly help people approaching the package for the first time and having to decide which one to use, or how CoordinateCleaner differs from the others.

Done. Added the links.

How will things proceed now?

Cheers,

Alex

maelle commented 6 years ago

We'll wait for @isteves to chime in when she has time, and then we should be close to the end of this process. 😺

isteves commented 6 years ago

Hey all,

I took another brief look through the package. Rather than getting deep into the code, I checked it out the way I would any new package--I looked through the pkgdown docs, tested out the examples, and checked out bits and pieces of the documentation.

First, some praise 👏 🥇 🎆

I'm a fan of the pkgdown website. I think it's a great addition, and it made it easy to re-orient myself to the package. Specifically, I liked:

the organization of the functions on the reference page
the organization of the Articles -- they're well-ordered and divided into meaningful categories
the function comparison table! I don't remember it from before, and I think it would be great even as a stand-alone resource. It provides a really comprehensive overview of CoordinateCleaner and other packages in this realm

Typos/small fixes

(note--mostly just checked the pkgdown version of these...)

https://azizka.github.io/CoordinateCleaner/articles/comparison_other_software.html

In the R-environment the scrubr an biogeo offer cleaning approaches complementary to CoordinateCleaner.

Sidenote: My personal preference would be to avoid starting a sentence with a lower-case package name, but it’s up to you: scrubr combines basic geographic cleaning… -> The scrubr package combines…

https://azizka.github.io/CoordinateCleaner/articles/Background_the_institutions_database.html#data-compilation Image not showing. It's trying to link to this
https://azizka.github.io/CoordinateCleaner/articles/Tutorial_Cleaning_GBIF_data_with_CoordinateCleaner.html#identifying-erroneous-coordinates-using

Identifying erroneous coordinates using

We based the choice of tests on common problems observed in biological collection databases (see for example (???))

https://github.com/azizka/CoordinateCleaner/blob/69e1848fb9ab7654bdf806c790f56e4dd354d8b0/R/clean_dataset.R#L25

poison

💀

https://azizka.github.io/CoordinateCleaner/articles/Tutorial_Cleaning_GBIF_data_with_CoordinateCleaner_files/figure-html/unnamed-chunk-15-1.png Increase point size? I thought I was looking at a blank plot at first...

General comments/food for thought

I like the output of clean_coordinates, but I found it strange that some columns from the initial data were dropped. I was expecting behavior more similar to broom::augment.
The documentation of clean_dataset confused me. Given the description ("Identifies potentially problematic coordinates..."), I was expecting to get flagged coordinates as an output, rather the output of a test. Something like this may help: "Tests for problems associated with coordinate conversions and rounding...." Perhaps documenting the output with @return could also help
for the cc_* functions, the default behavior is to filter out "bad" records, but the message says "Flagged XX records." For me, "flagged" ~ "marked." Changing the wording to "removed"/"omitted"/"filtered out" would more clearly indicate to me that the function removes records.
cc_sea downloads something each time I run it. Perhaps you can download the relevant files just the first time the user runs the function with something along the lines of:

path <- file.path(system.file(package = "CoordinateCleaner"), "sea.EXT")

if(!file.exists(path)) {
    download_sea(path)
}

sea <- load_sea(path)

I like your example of the cc_* functions + the pipe. An example where flags are added as extra columns may also be worth including:

exmpl %>%
    as_tibble() %>% 
    mutate(val = cc_val(., value = "flagged"),
                sea = cc_sea(., value = "flagged"))

I recently learned about the @inheritParams/@inherit tags, which was such a game-changer! You can define your parameter(s) once and use @inheritParams fxn_name for any other function that uses the parameter(s) (and same for sections). Definitely give it a try at some point if you haven't already 😄

I think it's almost good to go!

-Irene

maelle commented 6 years ago

Many thanks @isteves !

azizka commented 6 years ago

Thanks @isteves!

I have now included almost all suggestions; please see a point-by-point reply below. I am very sorry for the typos; the documentation has become so large now that some always slip through.

Thanks especially for the hint with @inherit. I am also continuously working to increase the test coverage. Cheers,

Alex

Typos/small fixes

In the R-environment the scrubr an biogeo offer cleaning approaches complementary to CoordinateCleaner.

Fixed. Changed to ‘and’.

Sidenote: My personal preference would be to avoid starting a sentence with a lower-case package name, but it’s up to you: scrubr combines basic geographic cleaning… -> The scrubr package combines…

Fixed. Changed as suggested.

https://azizka.github.io/CoordinateCleaner/articles/Background_the_institutions_database.html#data-compilation

Image not showing. It's trying to link to this

Fixed. Removed figure and replaced by a link.

https://azizka.github.io/CoordinateCleaner/articles/Tutorial_Cleaning_GBIF_data_with_CoordinateCleaner.html#identifying-erroneous-coordinates-using

Identifying erroneous coordinates using

Fixed. Changed to ‘with’.

We based the choice of tests on common problems observed in biological collection databases (see for example (???))

Fixed. Fixed citation.

https://github.com/azizka/CoordinateCleaner/blob/69e1848fb9ab7654bdf806c790f56e4dd354d8b0/R/clean_dataset.R#L25

poison :skull:

Fixed. Changed to ‘Poisson’. :fish:

https://azizka.github.io/CoordinateCleaner/articles/Tutorial_Cleaning_GBIF_data_with_CoordinateCleaner_files/figure-html/unnamed-chunk-15-1.png

Increase point size? I thought I was looking at a blank plot at first...

Not fixed. I completely agree, that the points are hard to see which is anoying. However, this plot shows the analyses matrix as used during the test. So, in this case it was a 1000 x 1000 matrix, hence the resulting diagnostic plot has the same resolution and individual cells are plotted very small. I think it is preferably to live with this visualization to keep the link between the test and the output plot, also in the vignette.

General comments/food for thought

I like the output of clean_coordinates, but I found it strange that some columns from the initial data were dropped. I was expecting behavior more similar to broom::augment.

Fixed. Changed as suggested. The “spatialvalid” output of clean_coordinates and clean_fossils now add the test columns to x, similar to broom::augment.

The documentation of clean_dataset confused me. Given the description ("Identifies potentially problematic coordinates..."), I was expecting to get flagged coordinates as an output, rather the output of a test. Something like this may help: "Tests for problems associated with coordinate conversions and rounding...." Perhaps documenting the output with @return could also help

Fixed. Changed the description as suggested, and improved the documentation of the output with @return.

for the cc_* functions, the default behavior is to filter out "bad" records, but the message says "Flagged XX records." For me, "flagged" ~ "marked." Changing the wording to "removed"/"omitted"/"filtered out" would more clearly indicate to me that the function removes records.

Fixed. Changed the wording in the function title to “Identify” and in the description to “Removes or flags”.

cc_sea downloads something each time I run it. Perhaps you can download the relevant files just the first time the user runs the function with something along the lines of: path <- file.path(system.file(package = "CoordinateCleaner"), "sea.EXT")

if(!file.exists(path)) { download_sea(path) }

sea <- load_sea(path)

Fixed. Implemented as suggested for cc_sea and cc_urb

I like your example of the cc_* functions + the pipe. An example where flags are added as extra columns may also be worth including: exmpl %>% as_tibble() %>% mutate(val = cc_val(., value = "flagged"), sea = cc_sea(., value = "flagged"))

Fixed. Added the suggested example to the “Quickstart_Flagging_problematic_coordinates_in_a_nutshell” and Tutorial_Cleaning_GBIF_data_with_CoordinateCleaner vignettes.

I recently learned about the @inheritParams/@inherit tags, which was such a game-changer! You can define your parameter(s) once and use @inheritParams fxn_name for any other function that uses the parameter(s) (and same for sections). Definitely give it a try at some point if you haven't already

Fixed. Used @inheritParams and @inherit throughout the documentation. Very nice, thanks!

isteves commented 6 years ago

Hey @azizka, it looks great! @maelle -- anything else we should do?

ropensci / software-review

CoordinateCleaner #210

Summary

Requirements

Publication options

Detail

Editor checks:

Editor comments

Package Review

Documentation

Functionality

Final approval (post-review)

Review Comments

Spelling

Code duplication/style

Miscellaneous

Vignettes

Examples

Community guidelines

Tests

Packaging guidelines

Spelling

Code duplication/style

Miscellaneous

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

Package size

README

data

extra-gazetteers

Documentation (man folder)

Functions

Dependencies

Vignettes

Package size

README

data

Documentation (man folder)

Functions

Vignettes

First, some praise 👏 🥇 🎆

Typos/small fixes

General comments/food for thought

Typos/small fixes

General comments/food for thought

Documentation (`man` folder)