Feedback wanted on rOpenSci standards for spatial statistical software

mpadge commented 3 years ago

rOpenSci is expanding its scope of peer-reviewed software to include specifically statistical software. One of the categories we will consider in-scope is spatial software, for which we have developed a suite of standards which we would like general feedback on. The standards can be viewed in the main project document. Please note the following:

These statistical standards extend from our set of General Standards;
The project explicitly addresses statistical software, which is a relatively small domain with the larger system of spatial software within R; and we are particularly interested in standards for the core algorithms of spatial statistical software.

This issue is accordingly intended to seek answers to the following primary question:

What standards might apply to the core algorithms of spatial statistical software?

Please comment here; in rOpenSci's discussion forum, directly via issues in the project repository (like this one from @jsta), or - the best way of all!! - by helping directly with the standards via pull request to that repo (like this one from @Nowosad). Please also help by suggesting any other improvements you might see to any aspects of these standards, beyond the specific question of algorithms. Hoping for some constructive input from the r-spatial community!

tim-salabim commented 3 years ago

@mpadge Regarding spatial cross validation, you could add the following paper:

https://www.sciencedirect.com/science/article/abs/pii/S0304380019303230

edzer commented 3 years ago

@mpadge great, a lot of work went into that! Reviewing it is also a lot of work; here's a first shot at it:

6.8.1 a spherical geometry is two-dimensional (it's on a surface, not a volume)
do you consider rectilinear as more general than Cartesian (i.e. where dimensions are orthogonal)?
spatstat accepts rectilinear data in a Cartesian system, not two-dimensional data (which would include spherical)
for terminology regarding coordinate systems, reference systems etc I think this set of standards should not reinvent the wheel but follow the OGC definitions in WKT2; we're trying to do the same in our upcoming book (Ch 2)
SP1.0 I would go further and recommend that software should give an error when it gets data of the wrong kind (as sf -> spatstat conversion now does)
"Elevation is a third spatial dimension, while other attributes measured at spatial locations do not represent additional dimensions." I think the exception here is time.
I don't get why you would recommend using time series classes, but not spatial classes, I don't see the difference (it can't be in being regular, as xts or zoo for instance don't require regularly spaced time index). I also don't see why you would find it sufficient to check for column names to be named "longitude" or "latitude" - you'd need all abbreviations (lon, long, lng), small/large caps, translations to other languages - where does that stop? I'm not suggesting to settle on one set of classes, but on the use of classes in general - there are plenty of systems around now. This would make it possible to convert between class systems, and while doing that catch the wrong coordinate type errors. I don't think that SP2.0b is a good idea.
the examples of st_centroid and st_buffer raising warnings when used on geodetic coordinates go away if you set sf_use_s2(TRUE); I plan to make this the default, and so the warnings would go away, soon.
"While not a standard in itself, it is expected that spatial software should not, absent very convincing and explicit justification, attempt to reconstruct aspects of these generic libraries." I don't think it is the place for a standards document to prescribe this. What if GDAL stops being maintained (not so hypothetical)? Packages raster & terra implement a lot of things that GDAL also can do - I think this is a good thing. Do you think any author would have to justify that?
I would replace SP2.1 with "the maintainer of sp has made it clear that new software should build upon sf, not sp". There is no longer a reason to continue building new software using sp classes.
SP2.3: what is meant by "original spatial formats"?
SP2.4: "version 6 or larger" (PROJ 8 has been released now)
SP2.4a: I think accepting proj.4 strings as input is often still fine (with a number of exceptions, such as use of the +to_wgs84=... or +init=epsg:xxxx, maybe more) but that representing a CRS only by a proj.4 string is no longer sufficient or robust. It is still perfectly fine to use ... %>% st_transform("+proj=geocent") to convert to geocentric coordinates - how else would you do this?
SP2.7 "Spatial Software should accept input data in as many specific spatial classes as possible." is not only a moving target, but also maybe asking too much.
SP2.11 I would suggest to drop this, or make it more specific (what kind of inputs?).
I think SP3.2 takes a model-based perspective; when dealing with design-based samples it is perfectly fine to ignore densities - I don't think you want to go into this issue while writing software standards, you don't want to push anyone in a particular methodological approach here; see here for further explanation.
SP3.4 similar point: I don't think you should forbid what everyone does.
SP3.5 I don't understand what this means
SP3.6 again: when you work with a proper design-based sample, there is no issue in doing this. For non-design-based samples this issue is an open research question, taking a position and elevating it to a standard is not the right thing to do.
SP5.2, maybe use "placing the proper variables along the x- and y-axis" (according to the EPSG authority, in EPSG:4326 now x is northing, y is easting)
SP5.3 in sf and sp we put degree symbols in axis tic labels for ellipsoidal coordinates, but not m for projected coordinates; I know of no other software that does this so maybe weaken this "standard"
I suggest to add SP6.7: packages shall only test issues it is responsible for, and not whether some "other package" does "the right thing". When a developer wants such tests to be carried out, consider contributing tests to this "other package".

Again: great work, and happy to discuss!

mpadge commented 3 years ago

First and foremost, thanks for declaring that,

for terminology regarding coordinate systems, reference systems etc I think this set of standards should not reinvent the wheel but follow the OGC definitions in WKT2; we're trying to do the same in our upcoming book (Ch 2)

That is indeed a really good point, and the whole nomenclature of coordinate systems has been revised to be as consistent with these WKT2 definitions as possible. That has the primary effect of generally replacing the former "rectilinear" with "Cartesian". These standards nevertheless reflect a general purview of spatial software beyond "geodetic" systems (WKT2 terminology), such that curvilinear systems may include, for example, celestial coordinate systems, and even potentially extending to systems of unknown dimensionality. In contrast, WKT2 defines no "umbrella" term for such systems, and merely declares that elliptical coordinate systems are geodetic, while spherical systems are commonly associated with geodetic systems. In lieu of external reference, these standards employ the term "curvilinear", but please feel free to suggest alternatives.

6.8.1 a spherical geometry is two-dimensional (it's on a surface, not a volume)

The statement is that,

a spherical geometry is a two- (or maybe three-)dimensional curvilinear space.

It can still be 3-D though (x,y,z), and is notably always so in celestial coordinate systems which are still spherical, so I presume that statement is okay?

do you consider rectilinear as more general than Cartesian (i.e. where dimensions are orthogonal)?

Terminology updated as noted above, so identity of rectilinear with Cartesian is now explicit.

spatstat accepts rectilinear data in a Cartesian system, not two-dimensional data (which would include spherical)

Point amended accordingly; thanks!

SP1.0 I would go further and recommend that software should give an error when it gets data of the wrong kind (as sf -> spatstat conversion now does)

Happy to accept that, and so have replaced 2.0b (see below) with a new version stating this expectation.

SP2.0b Class systems should ensure that functions error appropriately, rather than merely warning, in response to data from inappropriate spatial domains.

"Elevation is a third spatial dimension, while other attributes measured at spatial locations do not represent additional dimensions." I think the exception here is time.

Good point - reference to time explicitly included now, thanks!

I don't get why you would recommend using time series classes, but not spatial classes, I don't see the difference (it can't be in being regular, as xts or zoo for instance don't require regularly spaced time index). I also don't see why you would find it sufficient to check for column names to be named "longitude" or "latitude" - you'd need all abbreviations (lon, long, lng), small/large caps, translations to other languages - where does that stop? I'm not suggesting to settle on one set of classes, but on the use of classes in general - there are plenty of systems around now. This would make it possible to convert between class systems, and while doing that catch the wrong coordinate type errors. I don't think that SP2.0b is a good idea.

Also a good point, and so SP2.0 has been revised to now state that all spatial software should use class systems, and SP2.0b has been effectively entirely removed, replaced with the new version given above.

the examples of st_centroid and st_buffer raising warnings when used on geodetic coordinates go away if you set sf_use_s2(TRUE); I plan to make this the default, and so the warnings would go away, soon.

Thanks - those examples have been removed now regardless, because the revision of SP2.0 described above states that software should error rather than warn.

"While not a standard in itself, it is expected that spatial software should not, absent very convincing and explicit justification, attempt to reconstruct aspects of these generic libraries." I don't think it is the place for a standards document to prescribe this. What if GDAL stops being maintained (not so hypothetical)? Packages raster & terra implement a lot of things that GDAL also can do - I think this is a good thing. Do you think any author would have to justify that?

Good point which made me realise that simply removing that point makes that paragraph clearer and more focussed. Thanks! (And as you well know, my osmdata package was to a large extent an "attempt to reconstruct aspects of these generic libraries," so yeah, doing so should be generally admitted to be a good thing indeed.)

I would replace SP2.1 with "the maintainer of sp has made it clear that new software should build upon sf, not sp". There is no longer a reason to continue building new software using sp classes.

Thanks, amended accordingly.

SP2.3: what is meant by "original spatial formats"?

Yeah, unclear terminology introduced by me in response to @Nowosad's comment that spatial packages should demonstrate loading data in "spatial data formats rather than R objects". Revised to now say packages should:

test code which load those data in spatial formats, rather than R-specific binary formats such as .Rds.

Feel free to suggest improvements!

SP2.4: "version 6 or larger" (PROJ 8 has been released now)

Amended accordingly, thanks.

SP2.4a: I think accepting proj.4 strings as input is often still fine (with a number of exceptions, such as use of the +to_wgs84=... or +init=epsg:xxxx, maybe more) but that representing a CRS only by a proj.4 string is no longer sufficient or robust. It is still perfectly fine to use ... %>% st_transform("+proj=geocent") to convert to geocentric coordinates - how else would you do this?

Sorry for lack of clarity - that was indeed what i was trying to convey, with updated version now stating:

SP2.4a Software should not permit coordinate reference systems to be defined on the basis of so-called "PROJ4-strings" alone.

Any further suggestions for improvement?

SP2.7 "Spatial Software should accept input data in as many specific spatial classes as possible." is not only a moving target, but also maybe asking too much.

Yes, that is true. That was an extension of the equivalent time-series standard which is itself really only enabled because of the mighty tsbox package. But conversion of spatial classes is far more intricate than time series, so happy to remove that standard.

SP2.11 I would suggest to drop this, or make it more specific (what kind of inputs?).

Thanks, that has been removed, also because revised general standards now include sufficient expectation of that kind of ability to accept class-based columns in general tabular inputs anyway, so should be covered there.

I think SP3.2 takes a model-based perspective; when dealing with design-based samples it is perfectly fine to ignore densities - I don't think you want to go into this issue while writing software standards, you don't want to push anyone in a particular methodological approach here; see here for further explanation.

Thanks for the reference - that standard definitely "takes a model-based perspective," which that paper clarifies very well. I nevertheless think that some form of that standard should be retained, importantly noting that it only states that software "should enable sampling procedures to be based on ... densities. The paragraph immediately following that standard declares that,

the standard merely suggests that software should enable such density-based samples to be taken, not that it must, or even necessarily should by default.

An important link is again with time-series standards, for which covariance estimation in particular has seen great recent advances through routines for what is effectively density-based sampling of covariance ellipsoids, leading to the equivalent covariance standard being elevated into the General class. The sole motivation both of that one (G3.1) and SP3.2 is to (strongly) encourage developers to at least think about these issues. Alternative sampling procedures do not have to be implemented by default, and in the case of covariance, compliance with G3.1 merely requires that the function generating covariance estimates be minimally exposed as a parameter, so calculation methods can be substituted as desired by users. The standards attempt to avoid any recommendation of any particular methodological approach, and only take forms like these where we have perceived that the "default" behaviour (stats::cov() for covariance; design-based sampling in current context) can be extended by offering alternative modes. Happy to discuss further!

SP3.4 similar point: I don't think you should forbid what everyone does.

Fair enough, that was quite proscriptive as formulated, and has been revised to,

SP3.4 Where possible, spatial clustering software should avoid using standard non-spatial clustering algorithms in which spatial proximity is merely represented by an additional weighting factor in favour of explicitly spatial algorithms.

This still presents one of the more proscriptive of all standards, but was, as with all other standards, primarily informed by empirical examinations of good and bad practices in actual software. It merely aims to discourage bad practices as much as possible, and to encourage developers to at least think of alternatives. Do you think it should just be removed?

SP3.5 I don't understand what this means

Additional explanatory note added immediately afterwards with link to the definition of "broadcasting".

SP3.6 again: when you work with a proper design-based sample, there is no issue in doing this. For non-design-based samples this issue is an open research question, taking a position and elevating it to a standard is not the right thing to do.

Having read the reference you included above, I see what you mean. From a design-based perspective, sampling test and training data from a common region may indeed be perfectly okay. I nevertheless think there is a role for a standard which encourages developers to think explicitly about the (spatial) relationships between test and training data sets, and to avoid potential pitfalls arising through exclusively generating the two as random samples from the same region.

How about something like the following?

SP3.6 Spatial machine learning software should ensure that test and training data are generated using sampling procedures appropriate to the domain or intended use of that software. SP3.6a The effects of generating test and training data using inappropriate sampling procedures should be documented and/or tested.

We note that there are no comparable General Standard for Machine Learning Software, but that such distinction is particularly important for spatial machine learning software because it is frequently inappropriate to distinguish test and training data by taking samples from the same spatial region. One common method employed to generate distinct test and training data is spatial partitioning. There may nevertheless be cases in which such sampling from a common spatial region is appropriate, for example for software intended to analyse or model temporally-structured spatial data for which a more appropriate distinction might be temporal rather than spatial. Adherence to this standard merely requires that the distinction between test and training data be explicitly considered and documented (and possibly tested as well).

I have updated to that form for now in the hope that that allays some of your concerns, but would be open to any further feedback.

SP5.2, maybe use "placing the proper variables along the x- and y-axis" (according to the EPSG authority, in EPSG:4326 now x is northing, y is easting)

Updated accordingly; thanks.

SP5.3 in sf and sp we put degree symbols in axis tic labels for ellipsoidal coordinates, but not m for projected coordinates; I know of no other software that does this so maybe weaken this "standard"

Good examples, and so standard weakened to now state that software should,

SP5.2 Ensure that axis labels include appropriate units.

I suggest to add SP6.7: packages shall only test issues it is responsible for, and not whether some "other package" does "the right thing". When a developer wants such tests to be carried out, consider contributing tests to this "other package".

That's a good point, but also if it is to be included an entirely general one that would belong in General Standards, yet even there, it would be hard to formulate it in a way that wasn't partly intended as a discouragement. Being so, I'll leave that for now, but feed free to PR anytime if you do see a need for such a statement.

Thank you very much @edzer for this very insightful and helpful feedback! To help even further, could you now provide at least some brief responses to some of the direct questions i raised above. Thanks! And anybody else who happens to see this: Please also feel free to have a look over the standards and offer feedback.

edzer commented 3 years ago

SP2.4a Software should not permit coordinate reference systems to be defined on the basis of so-called "PROJ4-strings" alone.

Any further suggestions for improvement?

I would replace "defined on the basis of" by "represented merely by" (and drop "alone"), and add: "but use at least WKT2".

In the 3.6 area I think you try to write standards about how to do research, not how to write software. I don't think that will be helpful when writing or reviewing software. Software can't tell what kind of sampling design underlies a dataset, if any.

mpadge commented 3 years ago

Thanks again, SP2.4 revised as you suggested. I appreciate your comments on SP3.6, and agree that this is more like standardising how to do research than how to write software. I suspect in response that we should remove that standard, but will first seek some broader feedback on that general issue of research versus software design, which raises an important general point. Thanks!

r-spatial / discuss

Feedback wanted on rOpenSci standards for spatial statistical software #48