[Bug] Writing to shapefile loses data on "ambiguous" column names

lbusett commented 7 years ago

Hi,

with reference to https://github.com/r-spatial/sf/issues/464, I found a slighlty different "glitch", probably due to the ESRI driver.

In practice, st_write suffers loss of data in case of "ambigous" column names (e.g., columns with the same name, one uppercase and one lowercase) (tested on Windows - don't know about linux).

REPREX:

library(sp)
library(sf)
#> Warning: package 'sf' was built under R version 3.4.2
#> Linking to GEOS 3.6.1, GDAL 2.2.0, proj.4 4.9.3
data(meuse, package = "sp")
meuse_sf = sf::st_as_sf(meuse, coords = c("x", "y"), crs = 28992, 
                        agr = "constant")[1:3,1:3]
names(meuse_sf)[3] <- "COPPER"
outfile <- tempfile(fileext = ".shp")
sf::st_write(meuse_sf, outfile)

#> Writing layer `file2d7032e95391' to data source `C:\Users\LB_LAP~1\AppData\Local\Temp\RtmpMRqEp7\file2d7032e95391.shp' using driver `ESRI Shapefile'
#> Warning in CPL_write_ogr(obj, dsn, layer, driver,
#> as.character(dataset_options), : GDAL Message 6: Normalized/laundered field
#> name: 'COPPER' to 'COPPER_1'
#> features:       3
#> fields:         3
#> geometry type:  Point

reread <- sf::st_read(outfile, quiet = TRUE)
head(reread)
#> Simple feature collection with 3 features and 3 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 181025 ymin: 333537 xmax: 181165 ymax: 333611
#> epsg (SRID):    NA
#> proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
#>   cadmium copper COPPER_1              geometry
#> 1    11.7    299       NA POINT (181072 333611)
#> 2     8.6    277       NA POINT (181025 333558)
#> 3     6.5    199       NA POINT (181165 333537)

This can be easily solved by disambiguating somehow the column names beforehand:

if(length(dupl <- which(duplicated(tolower(names(meuse_sf)))))) {
    names(meuse_sf)[dupl] <- paste0(names(meuse_sf)[dupl], "_1")
}
sf::st_write(meuse_sf, outfile,delete_layer = TRUE)
reread <- sf::st_read(outfile, quiet = TRUE)
head(reread)
#> Simple feature collection with 3 features and 3 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 181025 ymin: 333537 xmax: 181165 ymax: 333611
#> epsg (SRID):    NA
#> proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
#>   cadmium copper COPPER_1              geometry
#> 1    11.7     85      299 POINT (181072 333611)
#> 2     8.6     81      277 POINT (181025 333558)
#> 3     6.5     68      199 POINT (181165 333537)

tim-salabim commented 7 years ago

(Why) Do you need to stick with shp? There are surely better alternatives. http://switchfromshapefile.org, especially gpkg. And write speed is not an issue anymore (neither in sf nor rgdal) https://github.com/r-spatial/sf/issues/470

lbusett commented 7 years ago

No, I don't need to stick with shp. But since (maybe unfortunately) shp is still one of the most used vector formats I thought it was worth mentioning (even if this is somehow a "corner case") .

edzer commented 7 years ago

There are many more limitations in shapefiles. I think that if you want to work with shapefiles, you'd better know about these limitations; I don't think it is a good idea to write R software that tries to warn you for all the consequences of using a format for which there are much better alternatives today.

lbusett commented 7 years ago

That's clearly your call. I just think worth mentioning that it's not that I "want" to work with shapefiles. The fact is however that for most "non programmer" people I work with shp is the de-facto standard with which they are used to work. Therefore, 1) I continuously have to deal with shapefiles created by other people, and 2) when I have to share something, I usually tend to save end-results of an analysis as a shapefile (That's how I noticed the issue, by the way). I know it's not ideal, but that is the current situation (at least for me).

rsbivand commented 7 years ago

The "ESRI Shapefile" driver does the same thing in rgdal, which sees the same output with the field name appended with "_1" and the values NA. This is the first "Creation Issues" bullet point:

"Attribute names can only be up to 10 characters long. Starting with version 1.7, the OGR Shapefile driver tries to generate unique field names. Successive duplicate field names, including those created by truncation to 10 characters, will be truncated to 8 characters and appended with a serial number from 1 to 99.

For example:

a → a, a → a_1, A → A_2;
abcdefghijk → abcdefghij, abcdefghijkl → abcdefgh_1"

in the driver description. The driver is behaving as designed, and adding even more workarounds around the documented driver only introduces more confusion.

I will look at why NAs are written, though - it appears that poFeature->SetField() takes a field name that is not case sensitive for the ESRI Shapefile driver, so overwrites the "copper" values with the "COPPER" values, and then does not assign any values to the "COPPER" field.

Time to move to GPKG, as also ESRI would advise ...

lbusett commented 7 years ago

Ok, duly noted. Just thought it could be useful to know about the (possible) problem. Let's say in the future I'll try to advocate switching to GPKG at least in my workplace, then!

rsbivand commented 7 years ago

rgdal 1.2-16, svn revision: 694 on R-Forge now writes per feature using field count, not field name (around lines 858 in src/OGR_write.cpp). I would appreciate checks - I'll submit to win-builder to generate a Windows binary.

rsbivand commented 7 years ago

Winbuilder Windows binary of rgdal_1.2-16 on the landing page here. What we need is feedback on whether this works for you, and whether anyone else sees regressions.

rsbivand commented 7 years ago

sf_write is using SetField(nm[j], ... where nm is a string vector. @edzer, maybe using j instead of nm[j] would prevent the caseless overwriting of copper by COPPER? I agree that special-casing shapefiles is a waste of time going forward, but this issue with this driver actually loses data. Anyone feel like asking on gdal-dev?

edzer commented 7 years ago

That sounds like a good suggestion.

lbusett commented 7 years ago

Hi,

I can confirm that on rgdal_1.2-16 the problem seems solved:

> meuse_sp
class       : SpatialPointsDataFrame 
features    : 3 
extent      : 181025, 181165, 333537, 333611  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.2369,50.0087,465.658,-0.406857,0.350733,-1.87035,4.0812 +units=m +no_defs 
variables   : 3
names       : cadmium, copper, lead 
min values  :     6.5,     68,  199 
max values  :    11.7,     85,  299 

> names(meuse_sp)[3] = "COPPER"
> writeOGR(meuse_sp, dsn = tempdir(), basename(tempfile), driver = "ESRI Shapefile")
> readOGR(dsn = tempdir(), basename(tempfile))

OGR data source with driver: ESRI Shapefile 
Source: "C:\Users\LB_LAP~1\AppData\Local\Temp\RtmpeSZiD8", layer: "file3e881e887780.shp"
with 3 features
It has 3 fields
class       : SpatialPointsDataFrame 
features    : 3 
extent      : 181025, 181165, 333537, 333611  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs 
variables   : 3
names       : cadmium, copper, COPPER_1 
min values  :     6.5,     68,      199 
max values  :    11.7,     85,      299

edzer commented 7 years ago

I can confirm that https://github.com/r-spatial/sf/commit/15be970aa134e72f6a3f4973c480c1e75c8392db also fixes this for st_write. Thanks!

lbusett commented 7 years ago

Thanks !

rsbivand commented 6 years ago

In rgdal, the fix appears to break the GPX driver. Re-checking with fix conditioned on "ESRI Shapefile" only - I think it is too complicated to find other odd drivers. Reprex (rgdal broke ASDAR 2ed chapter 4):

library(sf)
download.file("http://www.asdar-book.org/bundles2ed/die_bundle.zip", "die_bundle.zip")
unzip("die_bundle.zip")
Fires <- st_read("fires_120104.shp")
names(Fires)
Fires0 <- Fires[-unname(which(st_coordinates(Fires)[,2] < 0)),]
Fires$dt <- as.Date(as.character(Fires$FireDate), format="%d-%m-%Y")
Fires0 <- Fires[-unname(which(st_coordinates(Fires)[,2] < 0)),]
Fires1 <- Fires0[order(Fires0$dt),]
names(Fires1)[1] <- "name"
GR_Fires <- Fires1[Fires1$Country == "GR",]
st_write(GR_Fires, dsn="EFFIS.gpx", layer="waypoints", driver="GPX", dataset_options="GPX_USE_EXTENSIONS=YES")
GR <- st_read("EFFIS.gpx", "waypoints")
GR[1,c(5,24:28)]

The trouble seems to be created by GPX_USE_EXTENSIONS=YES, which puts the data columns in GR_Fires beyond the standard (but empty) GPX columns. So the fix works for shapefiles, but may not work as we guessed if drivers do their own things internally with numbering of fields in relation to field names - we cannot rely on them being in the same order.

rsbivand commented 6 years ago

I've tested the driver name against "ESRI Shapefile" in R, and passed that logical through a SEXP attribute to the top level writer, then as an int into the rgdal equivalent of SetFields(). Clunky, but restricts the use of field order to that driver, and reverting the regression to other drivers which can then rely on the field names rather than their order. Committed to R-Forge.

edzer commented 6 years ago

Thanks - same thing in sf now, checking driver only in C++ code.

r-spatial / sf

[Bug] Writing to shapefile loses data on "ambiguous" column names #538