r-spatial / discuss

a discussion repository: raise issues, or contribute!
54 stars 12 forks source link

Feedback wanted on rOpenSci standards for spatial statistical software #48

Open mpadge opened 3 years ago

mpadge commented 3 years ago

rOpenSci is expanding its scope of peer-reviewed software to include specifically statistical software. One of the categories we will consider in-scope is spatial software, for which we have developed a suite of standards which we would like general feedback on. The standards can be viewed in the main project document. Please note the following:

  1. These statistical standards extend from our set of General Standards;
  2. The project explicitly addresses statistical software, which is a relatively small domain with the larger system of spatial software within R; and we are particularly interested in standards for the core algorithms of spatial statistical software.

This issue is accordingly intended to seek answers to the following primary question:

Please comment here; in rOpenSci's discussion forum, directly via issues in the project repository (like this one from @jsta), or - the best way of all!! - by helping directly with the standards via pull request to that repo (like this one from @Nowosad). Please also help by suggesting any other improvements you might see to any aspects of these standards, beyond the specific question of algorithms. Hoping for some constructive input from the r-spatial community!

tim-salabim commented 3 years ago

@mpadge Regarding spatial cross validation, you could add the following paper:

https://www.sciencedirect.com/science/article/abs/pii/S0304380019303230

edzer commented 3 years ago

@mpadge great, a lot of work went into that! Reviewing it is also a lot of work; here's a first shot at it:

Again: great work, and happy to discuss!

mpadge commented 3 years ago

First and foremost, thanks for declaring that,

for terminology regarding coordinate systems, reference systems etc I think this set of standards should not reinvent the wheel but follow the OGC definitions in WKT2; we're trying to do the same in our upcoming book (Ch 2)

That is indeed a really good point, and the whole nomenclature of coordinate systems has been revised to be as consistent with these WKT2 definitions as possible. That has the primary effect of generally replacing the former "rectilinear" with "Cartesian". These standards nevertheless reflect a general purview of spatial software beyond "geodetic" systems (WKT2 terminology), such that curvilinear systems may include, for example, celestial coordinate systems, and even potentially extending to systems of unknown dimensionality. In contrast, WKT2 defines no "umbrella" term for such systems, and merely declares that elliptical coordinate systems are geodetic, while spherical systems are commonly associated with geodetic systems. In lieu of external reference, these standards employ the term "curvilinear", but please feel free to suggest alternatives.

The statement is that,

a spherical geometry is a two- (or maybe three-)dimensional curvilinear space.

It can still be 3-D though (x,y,z), and is notably always so in celestial coordinate systems which are still spherical, so I presume that statement is okay?

Terminology updated as noted above, so identity of rectilinear with Cartesian is now explicit.

Point amended accordingly; thanks!

Happy to accept that, and so have replaced 2.0b (see below) with a new version stating this expectation.

SP2.0b Class systems should ensure that functions error appropriately, rather than merely warning, in response to data from inappropriate spatial domains.

Good point - reference to time explicitly included now, thanks!

Also a good point, and so SP2.0 has been revised to now state that all spatial software should use class systems, and SP2.0b has been effectively entirely removed, replaced with the new version given above.

Thanks - those examples have been removed now regardless, because the revision of SP2.0 described above states that software should error rather than warn.

Good point which made me realise that simply removing that point makes that paragraph clearer and more focussed. Thanks! (And as you well know, my osmdata package was to a large extent an "attempt to reconstruct aspects of these generic libraries," so yeah, doing so should be generally admitted to be a good thing indeed.)

Thanks, amended accordingly.

Yeah, unclear terminology introduced by me in response to @Nowosad's comment that spatial packages should demonstrate loading data in "spatial data formats rather than R objects". Revised to now say packages should:

test code which load those data in spatial formats, rather than R-specific binary formats such as .Rds.

Feel free to suggest improvements!

Amended accordingly, thanks.

Sorry for lack of clarity - that was indeed what i was trying to convey, with updated version now stating:

SP2.4a Software should not permit coordinate reference systems to be defined on the basis of so-called "PROJ4-strings" alone.

Any further suggestions for improvement?

Yes, that is true. That was an extension of the equivalent time-series standard which is itself really only enabled because of the mighty tsbox package. But conversion of spatial classes is far more intricate than time series, so happy to remove that standard.

Thanks, that has been removed, also because revised general standards now include sufficient expectation of that kind of ability to accept class-based columns in general tabular inputs anyway, so should be covered there.

Thanks for the reference - that standard definitely "takes a model-based perspective," which that paper clarifies very well. I nevertheless think that some form of that standard should be retained, importantly noting that it only states that software "should enable sampling procedures to be based on ... densities. The paragraph immediately following that standard declares that,

the standard merely suggests that software should enable such density-based samples to be taken, not that it must, or even necessarily should by default.

An important link is again with time-series standards, for which covariance estimation in particular has seen great recent advances through routines for what is effectively density-based sampling of covariance ellipsoids, leading to the equivalent covariance standard being elevated into the General class. The sole motivation both of that one (G3.1) and SP3.2 is to (strongly) encourage developers to at least think about these issues. Alternative sampling procedures do not have to be implemented by default, and in the case of covariance, compliance with G3.1 merely requires that the function generating covariance estimates be minimally exposed as a parameter, so calculation methods can be substituted as desired by users. The standards attempt to avoid any recommendation of any particular methodological approach, and only take forms like these where we have perceived that the "default" behaviour (stats::cov() for covariance; design-based sampling in current context) can be extended by offering alternative modes. Happy to discuss further!

Fair enough, that was quite proscriptive as formulated, and has been revised to,

SP3.4 Where possible, spatial clustering software should avoid using standard non-spatial clustering algorithms in which spatial proximity is merely represented by an additional weighting factor in favour of explicitly spatial algorithms.

This still presents one of the more proscriptive of all standards, but was, as with all other standards, primarily informed by empirical examinations of good and bad practices in actual software. It merely aims to discourage bad practices as much as possible, and to encourage developers to at least think of alternatives. Do you think it should just be removed?

Additional explanatory note added immediately afterwards with link to the definition of "broadcasting".

Having read the reference you included above, I see what you mean. From a design-based perspective, sampling test and training data from a common region may indeed be perfectly okay. I nevertheless think there is a role for a standard which encourages developers to think explicitly about the (spatial) relationships between test and training data sets, and to avoid potential pitfalls arising through exclusively generating the two as random samples from the same region.

How about something like the following?

SP3.6 Spatial machine learning software should ensure that test and training data are generated using sampling procedures appropriate to the domain or intended use of that software. SP3.6a The effects of generating test and training data using inappropriate sampling procedures should be documented and/or tested.

We note that there are no comparable General Standard for Machine Learning Software, but that such distinction is particularly important for spatial machine learning software because it is frequently inappropriate to distinguish test and training data by taking samples from the same spatial region. One common method employed to generate distinct test and training data is spatial partitioning. There may nevertheless be cases in which such sampling from a common spatial region is appropriate, for example for software intended to analyse or model temporally-structured spatial data for which a more appropriate distinction might be temporal rather than spatial. Adherence to this standard merely requires that the distinction between test and training data be explicitly considered and documented (and possibly tested as well).

I have updated to that form for now in the hope that that allays some of your concerns, but would be open to any further feedback.

Updated accordingly; thanks.

Good examples, and so standard weakened to now state that software should,

SP5.2 Ensure that axis labels include appropriate units.

That's a good point, but also if it is to be included an entirely general one that would belong in General Standards, yet even there, it would be hard to formulate it in a way that wasn't partly intended as a discouragement. Being so, I'll leave that for now, but feed free to PR anytime if you do see a need for such a statement.


Thank you very much @edzer for this very insightful and helpful feedback! To help even further, could you now provide at least some brief responses to some of the direct questions i raised above. Thanks! And anybody else who happens to see this: Please also feel free to have a look over the standards and offer feedback.

edzer commented 3 years ago

SP2.4a Software should not permit coordinate reference systems to be defined on the basis of so-called "PROJ4-strings" alone.

Any further suggestions for improvement?

I would replace "defined on the basis of" by "represented merely by" (and drop "alone"), and add: "but use at least WKT2".

In the 3.6 area I think you try to write standards about how to do research, not how to write software. I don't think that will be helpful when writing or reviewing software. Software can't tell what kind of sampling design underlies a dataset, if any.

mpadge commented 3 years ago

Thanks again, SP2.4 revised as you suggested. I appreciate your comments on SP3.6, and agree that this is more like standardising how to do research than how to write software. I suspect in response that we should remove that standard, but will first seek some broader feedback on that general issue of research versus software design, which raises an important general point. Thanks!