r-spatial / sf

Simple Features for R
https://r-spatial.github.io/sf/
Other
1.34k stars 297 forks source link

Testing `st_read` #168

Closed barryrowlingson closed 7 years ago

barryrowlingson commented 7 years ago

Someone on StackOverflow was looking for shapefiles to test something (not R) and was pointed to a set of test shapefiles and other spatial data in the GDAL repository:

https://trac.osgeo.org/gdal/browser/trunk/autotest/ogr/data

So I downloaded that repository and tried st_read on them all. Here's my simple test code, run it from the folder with the shapefiles in:

 shps = data.frame(path=list.files(".","*.shp"), checked=FALSE, read_ok=NA, stringsAsFactors=FALSE)
 testem  = function(shps){for(i in 1:nrow(shps)){shps$checked[i]=TRUE;try({st_read(shps$path[i]);shps$read_ok[i]=TRUE})};return(shps)}
 shps = testem(shps)

Note this is just a test for reading, it doesn't test if it reads correctly.

The ones that fail are:

> shps$path[is.na(shps$read_ok)]
[1] "buggymultiline.shp"  "buggymultipoint.shp" "buggymultipoly.shp" 
[4] "buggymultipoly2.shp" "buggypoint.shp"      "emptymultiline.shp" 
[7] "emptymultipoint.shp" "emptymultipoly.shp" 

The "buggy" ones are (I think) invalid shapefiles and produce a "GDAL Error 1 Corrupted shape file" on st_read and via ogrinfo on the command line, so I think that's correct behaviour.

st_read copes with the "empty.shp" shapefile, but not the other empty ones, returning:

> st_read("emptymultipoly.shp")
Reading layer `emptymultipoly' from data source `/nobackup/rowlings/Downloads/gdal/autotest/ogr/data/emptymultipoly.shp' using driver `ESRI Shapefile'
Error in CPL_read_ogr(dsn, layer, as.character(options), quiet, iGeomField -  : 
  NULL pointer in to_multi_what: invalid input?

but my ogrinfo seems happy, running with no errors:

$ ogrinfo -al emptymultipoly.shp
INFO: Open of `emptymultipoly.shp'
      using driver `ESRI Shapefile' successful.

Layer name: emptymultipoly
Geometry: Polygon
Feature Count: 1
Extent: (100.000000, 0.000000) - (103.000000, 3.000000)
Layer SRS WKT:
(unknown)
FID: Integer64 (11.0)
NAME: String (60.0)
OGRFeature(emptymultipoly):0
  FID (Integer64) = 2
  NAME (String) = MultiPoly 1
$

So that might be a bug.

The wider question is whether it is worth implementing a fuller test suite against these and all the other spatial data files in the ogr test set? There's kml, gml, sqlite....

I'm not sure there's an argument for implementing all the tests in the GDAL OGR test suite, since you will probably be testing your GDAL more than your sf, but at least a set of shapefiles covering all geometries, dimensions (2D/3D), measurements (M and no-M), emptiness and pathologies might be useful.

edzer commented 7 years ago

I see

$ ogrinfo -al emptymultipoly.shp 
FAILURE:
Unable to open datasource `emptymultipoly.shp' with the following drivers.

and

> s = st_read("emptymultipoly.shp", "emptymultipoly")
Cannot open data source /tmp/emptymultipoly.shp
Error in CPL_read_ogr(dsn, layer, as.character(options), quiet, iGeomField -  : 
  Open failed.

which is, at least, consistent with eachother.

> library(sf)
Linking to GEOS 3.5.0, GDAL 2.1.0, proj.4 4.9.2

and sf from github. @mdsumner didn't you try something similar?

barryrowlingson commented 7 years ago

My GDAL that reads emptymultipoly.shp via ogrinfo okay is:

$ ogrinfo --version
GDAL 2.1.0, released 2016/04/25

and the sf version is:

> require(sf)
Loading required package: sf
Linking to GEOS 3.5.0, GDAL 2.1.0, proj.4 4.8.0
> packageVersion("sf")
[1] ‘0.2.9’

With R 3.3.0 on Linux Mint 17 Qiana - Ubuntu Trusty, kernel 3.13.0-24.

Curiouser and curiouser...

etiennebr commented 7 years ago

On empty geometries, from gdal autotests:

# Test empty multipoint, multiline, multipolygon.
# From GDAL 1.6.0, the expected behaviour is to return a feature with a NULL geometry
mdsumner commented 7 years ago

I previously set this up but didn't go far with it:

https://github.com/mdsumner/sf.autotest

It was triggered here: https://github.com/edzer/sfr/issues/105

sf.autotest could be useful for getting early indications of possible problems, but also could be used to make a nice summary of what is in the autotest suite.

It would be fun to scan all CRAN packages too to make a map of the "spatialized ecosystem" out there.

I had also started on code to download all CRAN packages, inspired by crandatapkgs and search for Spatial classes but got stuck at a few points. I'd love to build a big online report to browse the available data sets in CRAN, and have basic reports with maps and tables via leaflet and ogr/gdalinfo summary outputs . . . this is obviously where R is going in the "do everything well" department.

barryrowlingson commented 7 years ago

@edzer are you sure you are in the right working directory? Those errors look like "file not found" errors...

mdsumner commented 7 years ago

This is what I see

b1e3465b

library(sf)
#Linking to GEOS 3.5.1, GDAL 2.1.2, proj.4 4.9.3
 st_read(f)
Reading layer `emptymultipoly' from data source `/perm_storage/home/mdsumner/gdal_coverage/autotest/ogr/data/emptymultipoly.shp' using driver `ESRI Shapefile'

 Error in CPL_read_ogr(dsn, layer, as.character(options), quiet, iGeomField -  : 
  NULL pointer in to_multi_what: invalid input? 
system(sprintf("ogrinfo -al %s", f))
INFO: Open of `/perm_storage/home/mdsumner/gdal_coverage/autotest/ogr/data/emptymultipoly.shp'
      using driver `ESRI Shapefile' successful.

Layer name: emptymultipoly
Geometry: Polygon
Feature Count: 1
Extent: (100.000000, 0.000000) - (103.000000, 3.000000)
Layer SRS WKT:
(unknown)
FID: Integer64 (11.0)
NAME: String (60.0)
OGRFeature(emptymultipoly):0
  FID (Integer64) = 2
  NAME (String) = MultiPoly 1
devtools::session_info()
Session info ------------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.2 (2016-10-31)
 system   x86_64, linux-gnu           
 ui       RStudio (1.0.136)           
 language (EN)                        
 collate  en_AU.UTF-8                 
 tz       Australia/Hobart            
 date     2017-01-10                  

Packages ----------------------------------------------------------------------------
 package  * version     date       source                          
 DBI        0.5-1       2016-09-10 cran (@0.5-1)                   
 devtools   1.12.0.9000 2016-12-09 Github (hadley/devtools@1ce84b0)
 digest     0.6.11      2017-01-03 cran (@0.6.11)                  
 memoise    1.0.0.9001  2016-12-29 Github (hadley/memoise@884d565) 
 pkgbuild   0.0.0.9000  2016-11-18 Github (r-pkgs/pkgbuild@65eace0)
 pkgload    0.0.0.9000  2016-11-11 Github (r-pkgs/pkgload@def2b10) 
 Rcpp       0.12.8.4    2017-01-08 Github (RcppCore/Rcpp@8ac38d7)  
 sf       * 0.2-9       2017-01-10 local                           
 udunits2   0.13        2016-11-17 cran (@0.13)                    
 units      0.4-1       2016-12-09 cran (@0.4-1)                   
 withr      1.0.2       2016-06-20 CRAN (R 3.3.0)       
edzer commented 7 years ago

@barryrowlingson

> file.exists("emptymultipoly.shp", "emptymultipoly.shx", "emptymultipoly.dbf")
[1] TRUE TRUE TRUE
barryrowlingson commented 7 years ago

@edzer Edzer, sorry I accused you of making such a trivial error! But weird... I assume you can read the non-pathological shapefiles in the test folder?

Yesterday I created a docker container based on rocker/ropensci and built gdal 2.2.0dev out of github, it could read the empty*.shp files no problem with ogrinfo. Then I compiled sf in that container and got the same behaviour as I initially described. If my docker-fu was stronger I'd probably be able to push the container back to docker hub for everyone to try, but it was fairly easy to do.

Edzer's errors remain a mystery. Its not a GDAL without shapefile drivers is it?

edzer commented 7 years ago

OK, I get this now too - downloaded the wrong files (pretty convoluted pages, at osgeo).

Things look much worse (segfault) if you

st_read("emptymultipoly.shp", promote_to_multi=FALSE)

I now (https://github.com/edzer/sfr/commit/78f7e8d6d773420eccd20b0e887d0c9bd32a55e0) catch the NULL pointer earlier, but have no idea how to deal with it. On a null pointer p, you can't ask whether it is p->IsEmpty().

mdsumner commented 7 years ago

git clone on the GDAL autotest repo is straight forward, I will try to resurrect sf.autotest, constant regression testing would be great and Even has shown how to cover the variety of GDAL builds.

This is probably a r-hub worthy task, doing it properly is out of my league

edzer commented 7 years ago
> readOGR("emptymultipoly.shp", "emptymultipoly")
Error in readOGR("emptymultipoly.shp", "emptymultipoly") : 
  no features found
In addition: Warning message:
In ogrInfo(dsn = dsn, layer = layer, encoding = encoding, use_iconv = use_iconv,  :
  ogrInfo: all features NULL
rsbivand commented 7 years ago

See SEXP ogrFIDs from line 276 in rgdal/src/ogrsource.cpp. It steps through all the features if it does not trust poLayer->GetFeatureCount(). It looks as though the test doesn't tackle > 2G numbers of features.

edzer commented 7 years ago

In sf, I'm trying to stick with the simple feature standard. Except for points, all geometries can be empty, because they start with a number home many things are coming, and this number can be zero. sf supports those, e.g. by

> st_polygon(dim = "XY")
POLYGON()

and gdal lets you test whether a geometry is empty. What we see now is that a geometry, when read in gdal, returns not a (testable) empty geometry, but a NULL pointer. Although one could argue that this must be an empty geometry, I'm inclined to put that burden to gdal and just reject it in sf, as we do now. After all, there's no way to ask a NULL pointer what its dimensions are, or its geometry type.

etiennebr commented 7 years ago

Would it be possible to still return the non-spatial attributes and mock an empty geometry column? I have no experience with GDAL bindings, but I see they test feat and feat.GetGeometryRef():

if feat is None:
  return 'fail'
if feat.GetGeometryRef() is not None:
  return 'fail'

Which (I'm guessing) means other attributes are available?

mdsumner commented 7 years ago

I'd like to see this too, it's been discussed before.

@edzer clarified that sf is about simple features rather than being about GDAL, so to me that means the GDAL part belongs in another package. I think this is out of sf scope, though ideally a common GDAL package would be good. Currently sf is carrying the burden for GDAL, where really it should be outside - but we can't expect Edzer to do that as well :) I've actually been thinking of putting this up on the RConsortium wishlist.

edzer commented 7 years ago

@etiennebr I didn't run this (did you?) but guess that feat is None, so the second statement is not reached. You can always try to read the .dbf with attributes using foreign::read.dbf. @mdsumner you surprise me here, since you were one of the supporting authors of the sfr proposal, and I'm curious what your arguments are to make a convincing case. Do you also want to make a geos package, and a proj package?

mdsumner commented 7 years ago

I don't see it as contrary to the goals of the sfr proposal to want to have things modularized and more general. (Though, I realize it wouldn't look good to go to that wishlist, I wasn't thinking clearly there. )

A wonderful outcome of this project is seeing how concise the GDAL bindings are in modern form, and so replicating that for more general use of GDAL doesn't seem terribly difficult, for me it's just a matter of other priorities to get to it. (And it's a really long list that keeps growing, I wish I could focus only on sf much more!).

I just see that as out of scope of sfr given its goals, but it "would be nice" if sf imported a more general external package for GDAL itself. There are many extras that GDAL has, generic table read and SQL-injection for many providers, network models, and the utilities are now available in the library etc. Does that clarify?

Presumably we can use the LinkingTo capabilities anyway, and perhaps it's not a good idea to separate them.

Personally for GEOS, no I can't see wanting to use that outside of sf. PROJ.4 yes, but proj4 is already perfectly fine.

etiennebr commented 7 years ago

@edzer, I didn't run it either, but I was supposing the test was passing; feat was not None (valid pointer), so feat.GetGeometryreturned None. I haven't looked further, so I don't know where exactly the GDAL error got raised in sf. Just thought I'd mention that GetFeature() probably returns a valid pointer if it's useful. read.dbf could work on shapefiles, but I'm guessing other formats could provide empty geometries.

edzer commented 7 years ago

I repeat: if GDAL returns a NULL pointer, we have no geometry. If it returns an empty geometry, it returns a geometry, which is not NULL, and can be queried for a type. I don't see any point in continuing this discussion, but if you do, please do so after a working PR that resolves this so I can see what you mean (or you can see what I mean).

rsbivand commented 7 years ago

Please see the retrieve_ABS_null= argument in maptools::.Map2PolyDF() in maptools/R/SpatialPolys-methods.R. The Australian Bureau of Statistics stuffed non-locatable data into shapefile dbfs, hence the work-around there.

edzer commented 7 years ago
> readShapePoly("emptymultipoly.shp", retrieve_ABS_null=TRUE, delete_null_obj=TRUE)
  FID        NAME
1   2 MultiPoly 1
> foreign::read.dbf("emptymultipoly.dbf")
  FID        NAME
1   2 MultiPoly 1