Closed lbusett closed 7 years ago
(Why) Do you need to stick with shp? There are surely better alternatives. http://switchfromshapefile.org, especially gpkg. And write speed is not an issue anymore (neither in sf nor rgdal) https://github.com/r-spatial/sf/issues/470
No, I don't need to stick with shp. But since (maybe unfortunately) shp is still one of the most used vector formats I thought it was worth mentioning (even if this is somehow a "corner case") .
There are many more limitations in shapefiles. I think that if you want to work with shapefiles, you'd better know about these limitations; I don't think it is a good idea to write R software that tries to warn you for all the consequences of using a format for which there are much better alternatives today.
That's clearly your call. I just think worth mentioning that it's not that I "want" to work with shapefiles. The fact is however that for most "non programmer" people I work with shp is the de-facto standard with which they are used to work. Therefore, 1) I continuously have to deal with shapefiles created by other people, and 2) when I have to share something, I usually tend to save end-results of an analysis as a shapefile (That's how I noticed the issue, by the way). I know it's not ideal, but that is the current situation (at least for me).
The "ESRI Shapefile" driver does the same thing in rgdal, which sees the same output with the field name appended with "_1" and the values NA. This is the first "Creation Issues" bullet point:
"Attribute names can only be up to 10 characters long. Starting with version 1.7, the OGR Shapefile driver tries to generate unique field names. Successive duplicate field names, including those created by truncation to 10 characters, will be truncated to 8 characters and appended with a serial number from 1 to 99.
For example:
a → a, a → a_1, A → A_2;
abcdefghijk → abcdefghij, abcdefghijkl → abcdefgh_1"
in the driver description. The driver is behaving as designed, and adding even more workarounds around the documented driver only introduces more confusion.
I will look at why NAs are written, though - it appears that poFeature->SetField() takes a field name that is not case sensitive for the ESRI Shapefile driver, so overwrites the "copper" values with the "COPPER" values, and then does not assign any values to the "COPPER" field.
Time to move to GPKG, as also ESRI would advise ...
Ok, duly noted. Just thought it could be useful to know about the (possible) problem. Let's say in the future I'll try to advocate switching to GPKG at least in my workplace, then!
rgdal 1.2-16, svn revision: 694 on R-Forge now writes per feature using field count, not field name (around lines 858 in src/OGR_write.cpp). I would appreciate checks - I'll submit to win-builder to generate a Windows binary.
Winbuilder Windows binary of rgdal_1.2-16 on the landing page here. What we need is feedback on whether this works for you, and whether anyone else sees regressions.
sf_write is using SetField(nm[j], ... where nm is a string vector. @edzer, maybe using j instead of nm[j] would prevent the caseless overwriting of copper by COPPER? I agree that special-casing shapefiles is a waste of time going forward, but this issue with this driver actually loses data. Anyone feel like asking on gdal-dev?
That sounds like a good suggestion.
Hi,
I can confirm that on rgdal_1.2-16 the problem seems solved:
> meuse_sp
class : SpatialPointsDataFrame
features : 3
extent : 181025, 181165, 333537, 333611 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.2369,50.0087,465.658,-0.406857,0.350733,-1.87035,4.0812 +units=m +no_defs
variables : 3
names : cadmium, copper, lead
min values : 6.5, 68, 199
max values : 11.7, 85, 299
> names(meuse_sp)[3] = "COPPER"
> writeOGR(meuse_sp, dsn = tempdir(), basename(tempfile), driver = "ESRI Shapefile")
> readOGR(dsn = tempdir(), basename(tempfile))
OGR data source with driver: ESRI Shapefile
Source: "C:\Users\LB_LAP~1\AppData\Local\Temp\RtmpeSZiD8", layer: "file3e881e887780.shp"
with 3 features
It has 3 fields
class : SpatialPointsDataFrame
features : 3
extent : 181025, 181165, 333537, 333611 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
variables : 3
names : cadmium, copper, COPPER_1
min values : 6.5, 68, 199
max values : 11.7, 85, 299
I can confirm that https://github.com/r-spatial/sf/commit/15be970aa134e72f6a3f4973c480c1e75c8392db also fixes this for st_write
. Thanks!
Thanks !
In rgdal, the fix appears to break the GPX driver. Re-checking with fix conditioned on "ESRI Shapefile" only - I think it is too complicated to find other odd drivers. Reprex (rgdal broke ASDAR 2ed chapter 4):
library(sf)
download.file("http://www.asdar-book.org/bundles2ed/die_bundle.zip", "die_bundle.zip")
unzip("die_bundle.zip")
Fires <- st_read("fires_120104.shp")
names(Fires)
Fires0 <- Fires[-unname(which(st_coordinates(Fires)[,2] < 0)),]
Fires$dt <- as.Date(as.character(Fires$FireDate), format="%d-%m-%Y")
Fires0 <- Fires[-unname(which(st_coordinates(Fires)[,2] < 0)),]
Fires1 <- Fires0[order(Fires0$dt),]
names(Fires1)[1] <- "name"
GR_Fires <- Fires1[Fires1$Country == "GR",]
st_write(GR_Fires, dsn="EFFIS.gpx", layer="waypoints", driver="GPX", dataset_options="GPX_USE_EXTENSIONS=YES")
GR <- st_read("EFFIS.gpx", "waypoints")
GR[1,c(5,24:28)]
The trouble seems to be created by GPX_USE_EXTENSIONS=YES, which puts the data columns in GR_Fires beyond the standard (but empty) GPX columns. So the fix works for shapefiles, but may not work as we guessed if drivers do their own things internally with numbering of fields in relation to field names - we cannot rely on them being in the same order.
I've tested the driver name against "ESRI Shapefile" in R, and passed that logical through a SEXP attribute to the top level writer, then as an int into the rgdal equivalent of SetFields(). Clunky, but restricts the use of field order to that driver, and reverting the regression to other drivers which can then rely on the field names rather than their order. Committed to R-Forge.
Thanks - same thing in sf now, checking driver only in C++ code.
Hi,
with reference to https://github.com/r-spatial/sf/issues/464, I found a slighlty different "glitch", probably due to the ESRI driver.
In practice,
st_write
suffers loss of data in case of "ambigous" column names (e.g., columns with the same name, one uppercase and one lowercase) (tested on Windows - don't know about linux).REPREX:
This can be easily solved by disambiguating somehow the column names beforehand: