ropensci / statistical-software-review-book

Guide for development and peer-review of statistical software
https://stats-devguide.ropensci.org
42 stars 11 forks source link

Spatial standards - review by JN #29

Closed Nowosad closed 3 years ago

Nowosad commented 3 years ago

Hi @mpadge, I read the spatial standards section and added some tiny changes to the text. I also have some comments:

  1. I found some standards to be too strict or limiting. I assume that these standards should be considered as technical standards / best practices, rather than research limitations... I think changing from "should" to "should consider" could be an improvement. For example:

    • "SP2.4a Software should provide an ability to convert objects in any new class systems into representations of pre-existing classes such as those listed above." - I would assume some new class systems could be just used to store non-spatial results. What then?
    • "SP3.1 Spatial software which considers spatial neighbours should enable neighbour contributions to be weighted by distance (or other weighting variable), and not rely on a uniform-weight rectangular cut-off." - I assume some algorithms which consider spatial neighbours could make no sense when weighting by distance (e.g., categorical variables, use of dissimilarity measures, ...).
    • "SP3.3 Spatial regression software should explicitly quantify and distinguish autocovariant or autoregressive processes from those covariant or regressive processes not directly related to spatial structure alone." - I do not know if it is always possible...
    • "SP4.1 Any units included as attributes of input data should also be included within return values." - Let's say that input data has 10 variables (attributes) with different units. The used algorithm does not use any of the attributes (maybe it just uses coordinates) - so why should it include all the units in the output? Return of the units should be done only when the units are important in calculations... I also have a similar impression of "SP2.9 The pre-processing function described above should maintain all metadata attributes of input data.".
    • "SP5.0 Implement default plot methods for any implemented class system." - Even when the class system does not store spatial data anymore (e.g., calculating spatial metrics)?
    • "SP5.3 Offer an ability to generate interactive (generally html-based) visualisations of results." - It is similar to my previous point, but also interactive visualizations could not be possible (or just not suitable) for large spatial datasets (e.g., tens of millions of raster cells).
  2. Two standards seem to be better served by PROJ than routines in R packages:

    • "SP2.3b Documentation should explicitly clarify whether, and under which conditions, geographical coordinates are expected to be longitude-latitude or latitude-longitude." - I think PROJ should take care of it, as the order depends on the projection (?).
    • "SP6.0 Software which implements routines for transforming coordinates of input data should include tests which demonstrate ability to recover the original coordinates." - Isn't that the role of PROJ?
  3. "The latter standard, SP3.6, is commonly met by applying some form of spatial partitioning to data, and using spatially distinct partitions to define test and training data." - I think some references could be added to this standard (e.g., https://geocompr.robinlovelace.net/spatial-cv.html, https://doi.org/10.1109/IGARSS.2012.6352393, https://www.sciencedirect.com/science/article/abs/pii/S0304380019302145, https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13107)

  4. What does "Testing" mean in 5.9?

  5. 5.9: "Return the result of that algorithmic application" - in my opinion, it is obvious that an analytic algorithm will return the result. Maybe this point is not necessary?

  6. "Many developers of spatial software in R, including those responsible for the CRAN Task view on “Analysis of Spatial Data”," - what do you mean in the second part of this sentence? Roger is the only person "responsible" for this task view... How about "Many spatial software in R, including those featured in the CRAN Task view on “Analysis of Spatial Data”"?

  7. I do not understand the following standard: "SP3.2 Spatial software which relies on sampling from input data (even if only of spatial coordinates) should enable sampling procedures to be based on local spatial densities of those input data."

  8. I also think it would be good to inform/ask for feedback from other #rspatial developers, for example, by opening a new issue at https://github.com/r-spatial/discuss/issues.

  9. Code/data examples are not part of the spatial standards, but maybe they should be? Quality examples/documentation/vignettes are vital for users. I also think that, in most cases, example data should be stored in spatial data formats rather than R objects (of some class).

  10. I often see some new R packages that extend existing R packages (and thus could be great together), however they use different standards (different function naming, arguments' names and order, output structures, etc.). I think it could be useful if developers should list similar (related) packages during the review process and explain why they decided on using different standards (if applicable).

I hope that, at least, some of my comments make sense and could be useful in this project. Best, Jakub

mpadge commented 3 years ago

Thanks so much @Nowosad for that extremely useful feedback, responses to which are documented here. Note that these do not follow your numbers, and that 1-5 below all form part of your point 1.

  1. New paragraph inserted at start of General Standards:

Note that these and all category-specific standards that follow are intended to serve as recommendations for best practices. Note in particular that many standards are written using the word "should" in explicit acknowledgement that adhering to such standards may not always be possible. All standards phrased in these terms are intended to be interpreted as applicable under such conditions as "Where possible", or "Where applicable". Developers are requested to note any standards which they deem not applicable to their software via the srr package, as described in Chapter 3.

Your concerns about SP3.3, that it may not always be possible to distinguish autocovariant processes from covariant processes not directly related to spatial structure is in this context okay, because not all standards are necessarily expected to be applicable to all software.

  1. Standard SP2.4 has been revised in response to your comment that:

"SP2.4a Software should provide an ability to convert objects in any new class systems into representations of pre-existing classes such as those listed above." - I would assume some new class systems could be just used to store non-spatial results. What then?

That standard comes under the general sub-heading of Input Data, and has been reformulated like this:

SP2.4a Software which implements new classes to input spatial data (or the spatial components of more general data) should provide an ability to convert such input objects into alternative spatial classes such as those listed above.

  1. Standard SP3.1 has been revised in response to your comment that:

"SP3.1 Spatial software which considers spatial neighbours should enable neighbour contributions to be weighted by distance (or other weighting variable), and not rely on a uniform-weight rectangular cut-off." - I assume some algorithms which consider spatial neighbours could make no sense when weighting by distance (e.g., categorical variables, use of dissimilarity measures, ...).

It now says, "should wherever possible enable neighbour contributions to be weighted by distance", with the second statement modified to say software should, "not rely exclusively on a uniform-weight rectangular cut-off." The qualifying statement has been slightly modified to read, "(or other continuous weighting variable)", and is intended to indicate that such weighting is to be interpreted in terms sufficiently general to apply where possible to categorical variables, which are often associated with continuous variables able to be used to weight associations between data. Application to dissimilarity measures is also general possible in ways intended by this standard.

  1. Your comment in regard to SP4.1 that,

Return of the units should be done only when the units are important in calculations

has been addressed through revising that standard to now read,

SP4.1 Any aspects of input data which are included in output data (either directly, or in some transformed form) and which contain units should ensure those same units are maintained in return values.

Similarly, SP2.9 has been revised to read,

SP2.9 The pre-processing function described above should maintain those metadata attributes of input data which are relevant or important to core algorithms or return values.

  1. Your visualisation concerns are very insightful and helpful, and led to revising that general section (SP5) with a preliminary caveat that,

Spatial Software which returns objects in a custom class structure explicitly designed to represent or include spatial data should:

along with a disclaimer immediately after those standards that,

The preceding three standards will generally not apply to software which returns objects in a custom class structure yet which is not inherently spatial.

Your second concern about things like huge raster datasets not being able to be visualised has not been directly addressed. While that concern remains valid, the standards have hopefully been flagged as only ever conditionally applicable, so developers can readily judge that particular one to be non-applicable, yet having it there in that form may nevertheless motivate attempts to develop new routines anyway.

  1. Your comments (your point 2) about the role of PROJ are helpful, in response to which SP2.3b has been removed, because yes, PROJ should take care of that. You also comment on SP6.0 that PROJ should also ensure that, "Software which implements routines for transforming coordinates of input data should include tests which demonstrate ability to recover the original coordinates". This standard admits the possibility that developers use (or implement) any arbitrary routines for coordinate transformation, with no necessary reliance on PROJ alone. The standards has therefore been left in the same form, but with an explanatory note after it that,

This standard is applicable to any software which implements any routines for coordinate transformations, even if those routines are implemented via PROJ. Conversely, software which has no routines for coordinate transformations need not adhere to SP6.0, even if that software relies on PROJ for other purposes.

  1. (Your point 3) Suggested references added to SP3.6 - thank you very much for these!!

  2. (Your point 5) The statement that spatial software should, "Return the result of that algorithmic application" is necessary in our opinion because not all applications of algorithms return results, either because the primary function is called for some side-effect (such as visualisation), or because results are stored elsewhere rather than returned. This general statement is merely an encouragement to return something rather than nothing, yet the corresponding standards (SP4) are all only conditionally applicable anyway, so the point can readily be ignored, or at most considered an encouragement only.

  3. Your point 6 suggesting modifying reference to CRAN Task view on Analysis of Spatial Data was helpful, with that statement changed to, "including many of those featured on ...". Thanks!

  4. Your statement that you didn't understand SP3.2 has been addressed by adding an explanatory example following that standard:

An example of software which would not adhere to SP3.2 would be where input data were a simple matrix of spatial coordinates, and sampling were implemented using the sample() function to randomly select elements of those input data (like sample(nrow(xy), n)). In such a context, adhering to the standard would require including an additional prob vector where each point was weighted by the local density of surrounding points. Doing so would lead to higher probabilities of samples being taken from central clusters of higher densities than from outlying extreme points. Note that the standard merely suggests that software should enable such density-based samples to be taken, not that it must, or even necessarily should by default.

  1. Your comments that, "Code/data examples are not part of the spatial standards, but maybe they should be?" is hopefully addressed in the General Standards, which include a number of explicit standards pertaining to documentation and examples. But your comment that, example data should be stored in spatial data formats rather than R objects (of some class), is really useful, as reflected in the addition of the following new standard:

SP2.3 Software which accepts spatial input data in any standard format established in other R packages (such as any of the formats able to be read by GDAL, and therefore by the sf package) should include example and test code which load those data in original spatial formats.

  1. Finally, your comment on the usefulness of developers listing "similar (related) packages during the review process" is hopefully addressed by G1.1.

We are deeply indebted to your for taking the time to consider these standards in such depth, and for devising such constructive and useful suggestions. This alone is a great contribution to the project as a whole! Please feel free to offer any further suggestions, edits, comments, at any stage in the future. Thank you!

Nowosad commented 3 years ago

@mpadge I am very happy that you find my comments useful.