r-spatial / sf

Simple Features for R
https://r-spatial.github.io/sf/
Other
1.34k stars 298 forks source link

Encoding of layer names, attribute fields, text attributes #5

Closed rsbivand closed 7 years ago

rsbivand commented 8 years ago

The links to OGC documents point to the geometries, but don't cover the question of the encoding representation of file names, layer names, attribute field names or text attributes. There are some hooks in OGR/CPL for this, which should be aligned with the cross-platform mechanisms in R. GeoJSON seems to want Unicode, GPKG maybe UTF-8, etc.

edzer commented 7 years ago

this vignette has examples on how things are done by rgdal. There are gdal RFC's:

#ifdef CPL_RECODE_ICONV, defined in GDAL headers, indicates whether GDAL was compiled with iconv support.#ifdef CPL_RECODE_ICONV

rgdal/inst/etc/point.shp is an example, when read as UTF-8:

> (x <- st_read("point.shp", quiet = TRUE))
Simple feature collection with 1 feature and 3 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 17.7386 ymin: 49.612 xmax: 17.7386 ymax: 49.612
epsg (SRID):    32633
proj4string:    +proj=utm +zone=33 +datum=WGS84 +units=m +no_defs
                NAZEV       X      Y              geometry
1 St°Ýte× nad Ludinou 17.7386 49.612 POINT(17.7386 49.612)
eivindhammers commented 7 years ago

What is the status of this issue? I'm having trouble reading layers in a GDB with string fields encoded in UTF-8 (e.g. "Côte d’Ivoire"). Using rgdal::readOGR([...], encoding = "UTF-8", use_iconv = TRUE) works well, but naively using the the same options in st_read() throws

Error in st_sf(x, ..., agr = agr) : 
  no simple features geometry column present
edzer commented 7 years ago

Please provide a reproducible example, and I'll take a look; by email is also fine.

eivindhammers commented 7 years ago

Thanks. My rgdal OpenFileGDB driver doesn't support writing, so the MWE below is not a self-contained script. The example GDB can be downloaded here.

Note that options = "ENCODING=UTF-8" works when st_read-ing a .shp, but not GDB.

library(sf)
set_wd(path_to_gdb)
civ <- st_read(dsn = "CIV.gdb.zip")
civ2 <- st_read(dsn = "CIV.gdb.zip", encoding = "UTF-8", use_iconv = TRUE)
civ3 <- st_read(dsn = "CIV.gdb.zip", options = "ENCODING=UTF-8")
print(civ$COUNTRY[1])
print(civ3$COUNTRY[1])
edzer commented 7 years ago

The second command gives an error for me; encoding and use_iconv are not parameters for st_read, or for st_as_sf, to which ... is passed on; this is all documented. I see

> print(civ$COUNTRY[1])
[1] Côte d'Ivoire
Levels: Côte d'Ivoire
> print(civ3$COUNTRY[1])
[1] Côte d'Ivoire
Levels: Côte d'Ivoire

which looks good to me, but this is not the case for you? What does your sessionInfo() give?

eivindhammers commented 7 years ago

I was aware that passing encoding and use_iconv to st_read would not work, so not sure why I included it in the MWE.

I got

> print(civ3$COUNTRY[1])
[1] Côte d'Ivoire
Levels: Côte d'Ivoire

Can't reproduce it on a Mac, so forgive me for what turns out to be something of a RTFM problem. sessionInfo() on my Windows PC returns Norwegian (Bokmål)_Norway.1252 for all locale settings. Supplying stringsAsFactors = FALSE to st_read and declaring Encoding(civ$COUNTRY) gives the desired output.

> print(civ$COUNTRY[1])
[1] "Côte d'Ivoire"

I guess my initial question boils down to whether the equivalent to an encoding argument in st_read is possible (and/or desirable) when using a GDAL driver that doesn't support an ENCODING option, like for shapefiles?

edzer commented 7 years ago

I guess we should try CPLRecode on all strings, then, when an encoding needs to be set; alternatively, there is iconv on the R side.

rsbivand commented 7 years ago

On balance it is hard to know what to advise, as for reading the origin encoding is not often known, and the locale (and default encoding) of the target (read) platform is often unknown to the user. GDAL/OGR did try to handle this by trying to convert to UTF-8, but this also is far from guaranteed. There is an rgdal vignette about encoding as things were I think in 2008.

dpprdan commented 7 years ago

What is the status concerning this? Or rather, what is the best practice for reading shapefiles with windows-1252/CP-1252 encoding? This is not it:

library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
url_shp <- "https://www.suche-postleitzahl.org/download_v1/webmercator/mittel/plz-5stellig/shapefile/points/plz-5stellig-centroid.shp.zip"
download.file(url_shp, "plz-5stellig-centroid.zip")

unzip("plz-5stellig-centroid.zip")
plz_sf <-
  read_sf(
    "plz-5stellig-centroid.shp",
    options = "ENCODING=windows-1252",
    stringsAsFactors = FALSE
  )

ENCODING=windows-1252 my not be supported for Shapefiles?

plz_sf$note[42] # plz_sf[plz_sf$plz == "01609",]
#> [1] "01609 Gröditz, Wülknitz, Röderaue"

Encoding() also does not seem to work

Encoding(plz_sf$note) <- "windows-1252"
plz_sf$note[42]
#> [1] "01609 Gröditz, Wülknitz, Röderaue"
Session info ``` r devtools::session_info() #> Session info ------------------------------------------------------------- #> setting value #> version R version 3.4.0 (2017-04-21) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate German_Germany.1252 #> tz Europe/Berlin #> date 2017-05-11 #> Packages ----------------------------------------------------------------- #> package * version date source #> backports 1.0.5 2017-01-18 CRAN (R 3.3.2) #> base * 3.4.0 2017-04-21 local #> compiler 3.4.0 2017-04-21 local #> datasets * 3.4.0 2017-04-21 local #> devtools 1.13.0 2017-05-08 CRAN (R 3.4.0) #> digest 0.6.12 2017-01-27 CRAN (R 3.3.2) #> evaluate 0.10 2016-10-11 CRAN (R 3.3.1) #> graphics * 3.4.0 2017-04-21 local #> grDevices * 3.4.0 2017-04-21 local #> htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0) #> knitr 1.15.1 2016-11-22 CRAN (R 3.3.2) #> magrittr 1.5 2014-11-22 CRAN (R 3.3.0) #> memoise 1.1.0 2017-04-21 CRAN (R 3.3.3) #> methods * 3.4.0 2017-04-21 local #> Rcpp 0.12.10 2017-03-19 CRAN (R 3.3.3) #> rmarkdown 1.5 2017-04-26 CRAN (R 3.3.3) #> rprojroot 1.2 2017-01-16 CRAN (R 3.3.2) #> stats * 3.4.0 2017-04-21 local #> stringi 1.1.5 2017-04-07 CRAN (R 3.3.3) #> stringr 1.2.0 2017-02-18 CRAN (R 3.3.3) #> tools 3.4.0 2017-04-21 local #> utils * 3.4.0 2017-04-21 local #> withr 1.0.2 2016-06-20 CRAN (R 3.3.1) #> yaml 2.1.14 2016-11-12 CRAN (R 3.3.2) ```
edzer commented 7 years ago

On my computer, running ubuntu, both

> unique(read_sf("plz-5stellig-centroid.shp",stringsAsFactors = FALSE)$note)

and

unique(read_sf("plz-5stellig-centroid.shp",options = "ENCODING=UTF-8",stringsAsFactors = FALSE)$note)

gave comprehensible (readable, German-looking) output, e.g.

[8176] "99986 Vogtei, Kammerforst u.a."                                                      
[8177] "99988 Südeichsfeld"                                                                  
[8178] "99991 Großengottern, Heroldishausen"                                                 
[8179] "99994 Schlotheim"                                                                    
[8180] "99996 Menteroda, Obermehler"                                                         
[8181] "99998 Körner, Weinbergen"   

could you pls try that and report back? Pls also give your sessionInfo().

dpprdan commented 7 years ago

Sorry, my bat, the file is UTF-8 and not CP1252/windows-1252 This does not change the outcome though

library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
url_shp <- "https://www.suche-postleitzahl.org/download_v1/webmercator/mittel/plz-5stellig/shapefile/points/plz-5stellig-centroid.shp.zip"
download.file(url_shp, "plz-5stellig-centroid.zip")
unzip("plz-5stellig-centroid.zip")
plz_sf <-
  read_sf(
    "plz-5stellig-centroid.shp",
    options = "ENCODING=UTF-8",
    stringsAsFactors = FALSE
  )

unique(plz_sf)$note[8176:8181]
#> [1] "99986 Vogtei, Kammerforst u.a."       "99988 Südeichsfeld"                 
#> [3] "99991 Großengottern, Heroldishausen" "99994 Schlotheim"                    
#> [5] "99996 Menteroda, Obermehler"          "99998 Körner, Weinbergen"  

Encoding() actually does work with UTF-8. (Note that the output is the actual one from my console, since there also seems to be an encoding issue with the reprex package with which I made this MWE).

Encoding(plz_sf$note) <- "UTF-8"
plz_sf$note[8176:8181] 
#> [1] "99986 Vogtei, Kammerforst u.a."      "99988 Südeichsfeld"                 
#> [3] "99991 Großengottern, Heroldishausen" "99994 Schlotheim"                   
#> [5] "99996 Menteroda, Obermehler"         "99998 Körner, Weinbergen"  

But that is a bit tedious for multiple columns (or rather I wouldn't know how to do this with sapply or purrr, but that is not a sf problem, of course).

sessionInfo()
#> R version 3.4.0 (2017-04-21)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 14393)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
#> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=German_Germany.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] sf_0.4-2
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.10    digest_0.6.12   rprojroot_1.2   grid_3.4.0     
#>  [5] DBI_0.6-1       backports_1.0.5 magrittr_1.5    evaluate_0.10  
#>  [9] units_0.4-4     stringi_1.1.5   rmarkdown_1.5   tools_3.4.0    
#> [13] udunits2_0.13   stringr_1.2.0   yaml_2.1.14     compiler_3.4.0 
#> [17] htmltools_0.3.6 knitr_1.15.1

Also note the devtools::session_info() in my previous post.

dpprdan commented 7 years ago

I have done some more tests but cannot really say whether they are helpful or not. So maybe this just shows my quite limited knowledge of character encodings in R, but who knows... Anyway, here we go

library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
url_shp <- "https://www.suche-postleitzahl.org/download_v1/webmercator/mittel/plz-5stellig/shapefile/points/plz-5stellig-centroid.shp.zip"
download.file(url_shp, "plz-5stellig-centroid.zip")
unzip("plz-5stellig-centroid.zip")
plz_sf <-
  read_sf(
    "plz-5stellig-centroid.shp",
    options = "ENCODING=UTF-8",
    stringsAsFactors = FALSE
  )

(x <- tail(plz_sf$note))
#> [1] "99986 Vogtei, Kammerforst u.a."       "99988 Südeichsfeld"                 
#> [3] "99991 Großengottern, Heroldishausen" "99994 Schlotheim"                    
#> [5] "99996 Menteroda, Obermehler"          "99998 Körner, Weinbergen" 

Encoding is not declared

Encoding(x)
#> [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"

So declaring the encoding solves this?!

x_enc <- x
Encoding(x_enc) <- "UTF-8"
Encoding(x_enc)
#> [1] "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"
x_enc
#> [1] "99986 Vogtei, Kammerforst u.a."      "99988 Südeichsfeld"                 
#> [3] "99991 Großengottern, Heroldishausen" "99994 Schlotheim"                   
#> [5] "99996 Menteroda, Obermehler"         "99998 Körner, Weinbergen"

But why doesn't enc2utf8 work then?

x_enc <- enc2utf8(x)
Encoding(x_enc)
#> [1] "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"
x_enc
#> [1] "99986 Vogtei, Kammerforst u.a."       "99988 Südeichsfeld"                 
#> [3] "99991 Großengottern, Heroldishausen" "99994 Schlotheim"                    
#> [5] "99996 Menteroda, Obermehler"          "99998 Körner, Weinbergen" 

Let's try iconv, then. First I thought I'd have to encode to UTF-8, but that's not it.

x_enc <- iconv(x, from = "", to = "UTF-8")
Encoding(x_enc)
#> [1] "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"
x_enc
#> [1] "99986 Vogtei, Kammerforst u.a."       "99988 Südeichsfeld"                 
#> [3] "99991 Großengottern, Heroldishausen" "99994 Schlotheim"                    
#> [5] "99996 Menteroda, Obermehler"          "99998 Körner, Weinbergen" 

Converting from UTF-8 works. Huh? Ok, I'm lost.

x_enc <- iconv(x, from = "UTF-8", to = "")
Encoding(x_enc)
#> [1] "unknown" "latin1"  "latin1"  "unknown" "unknown" "latin1"
x_enc
#> [1] "99986 Vogtei, Kammerforst u.a."      "99988 Südeichsfeld"                 
#> [3] "99991 Großengottern, Heroldishausen" "99994 Schlotheim"                   
#> [5] "99996 Menteroda, Obermehler"         "99998 Körner, Weinbergen"

But I want UTF-8. So I have to do this?

x_enc <- iconv(x, from = "UTF-8", to = "") %>% enc2utf8(.)
Encoding(x_enc)
#> [1] "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"
x_enc
#> [1] "99986 Vogtei, Kammerforst u.a."      "99988 Südeichsfeld"                 
#> [3] "99991 Großengottern, Heroldishausen" "99994 Schlotheim"                   
#> [5] "99996 Menteroda, Obermehler"         "99998 Körner, Weinbergen"

readr::parse_character also works

x_enc <- readr::parse_character(x)
Encoding(x_enc)
#> [1] "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"
x_enc
#> [1] "99986 Vogtei, Kammerforst u.a."      "99988 Südeichsfeld"                 
#> [3] "99991 Großengottern, Heroldishausen" "99994 Schlotheim"                   
#> [5] "99996 Menteroda, Obermehler"         "99998 Körner, Weinbergen"

So to automate this one could do

plz_sf %>% dplyr::mutate_if(is.character, readr::parse_character) %>% tail()
#> Simple feature collection with 6 features and 4 fields
#> geometry type:  MULTIPOINT
#> dimension:      XY
#> bbox:           xmin: 1143297 ymin: 6646566 xmax: 1187078 ymax: 6673026
#> epsg (SRID):    3857
#> proj4string:    +proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext  +no_defs
#>      einwohner                                note   plz      qkm
#> 8176      6097      99986 Vogtei, Kammerforst u.a. 99986 86.61225
#> 8177      4866                  99988 Südeichsfeld 99988 40.30266
#> 8178      3528 99991 Großengottern, Heroldishausen 99991 40.92520
#> 8179      4243                    99994 Schlotheim 99994 28.64306
#> 8180      2933         99996 Menteroda, Obermehler 99996 43.10865
#> 8181      4942            99998 Körner, Weinbergen 99998 74.90953
#>                            geometry
#> 8176 MULTIPOINT(1158961.15480396...
#> 8177 MULTIPOINT(1143296.58920505...
#> 8178 MULTIPOINT(1178166.27891215...
#> 8179 MULTIPOINT(1187077.74352969...
#> 8180 MULTIPOINT(1178820.42957709...
#> 8181 MULTIPOINT(1176439.64196295...

until you find a better solution for sf?

edzer commented 7 years ago

My gut feeling is that the solution to this will be associated with sf, but in abandoning shapefiles. Maybe we should start issuing warnings to users that there are better alternatives.

@rsbivand do you have any better idea?

rsbivand commented 7 years ago

See also:

library(rgdal)
vignette("OGR_shape_encoding")
dpprdan commented 7 years ago

@rsbivand: Yeah, I've repeatedly tried to make sense of the vignette and what it means to my setting, but I still haven't figured it out.

@edzer: What do you mean with "abandoning shapefiles" and "better alternatives"? Shapefiles are often the only format available, even from official sources, e.g. the Bundesamt für Kartografie und Geodäsie.

Actually my gut fealing is that this is an R platform issue rather than a OGR issue (assuming that OGR does not behave differently on different platforms?), since you don't seem to have these problems with the same Shapefile on Ubuntu, @edzer?

edzer commented 7 years ago

It is then time to tell the officers behind those official sources that they should switch to better and open alternatives.

@rsbivand If indeed OGR always returns always UTF-8 strings, then on UTF-8 platforms there is no problem, and setting

Encoding(x) = "UTF-8"

on all character strings, variable and layer names should give non-UTF-8 platforms the possibility to adapt to local encoding, as (I believe) illustrated above, right? Or am I misreading the vignette?

rsbivand commented 7 years ago

The vignette was written in late 2008, and I haven't revisited it. I was running a course on which there were people with needs, so we found work-arounds that are also in the readOGR() code branching on shapefiles. The ESRI workaround originally was the cpg file to give the codepage on Windows systems. The distinction is between Windows GDAL built with or without iconv - without iconv, it tried some odd fixes, with iconv IIRC was predictable.

I agree that GPKG is a much more robust container (so does ESRI) ...

edzer commented 7 years ago

see https://github.com/edzer/sfr/commit/2c5254632ef6669b1792ac9e9420f89783c5b832

edzer commented 7 years ago

@dpprdan would be great of you could test this patch. Do you build from source, or shall we try to use win-builder?

edzer commented 7 years ago

Win-builder .zip file found here -- please check whether st_read now gives you place names in the right encoding in the example above without specifying anything.

dpprdan commented 7 years ago

Will check now (and yes, I don't build myself or would know how to). But first: This is not a Shapefile issue:

library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
url_gj <- "https://www.suche-postleitzahl.org/download_v1/webmercator/mittel/plz-5stellig/geojson/points/plz-5stellig-centroid.geojson.zip"
download.file(url_gj, "plz-5stellig-centroid.geojson.zip")
unzip("plz-5stellig-centroid.geojson.zip")
plz_gj_sf <-
  st_read(
    "plz-5stellig-centroid.geojson",
    stringsAsFactors = FALSE
  )
#> Reading layer `OGRGeoJSON' from data source `C:\Users\daniel\AppData\Local\Temp\Rtmp4Owepj\plz-5stellig-centroid.geojson' using driver `GeoJSON'
#> converted into: POINT
#> Simple feature collection with 8181 features and 4 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 666584 ymin: 6002128 xmax: 1667886 ymax: 7365790
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
tail(plz_gj_sf$note)
#> [1] "99986 Vogtei, Kammerforst u.a."       "99988 Südeichsfeld"                 
#> [3] "99991 Großengottern, Heroldishausen" "99994 Schlotheim"                    
#> [5] "99996 Menteroda, Obermehler"          "99998 Körner, Weinbergen" 
Session info ``` r devtools::session_info("sf") #> Session info ------------------------------------------------------------- #> setting value #> version R version 3.4.0 (2017-04-21) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate German_Germany.1252 #> tz Europe/Berlin #> date 2017-05-12 #> Packages ----------------------------------------------------------------- #> package * version date source #> DBI 0.6-1 2017-04-01 CRAN (R 3.3.3) #> graphics * 3.4.0 2017-04-21 local #> grDevices * 3.4.0 2017-04-21 local #> grid 3.4.0 2017-04-21 local #> magrittr 1.5 2014-11-22 CRAN (R 3.3.0) #> methods * 3.4.0 2017-04-21 local #> Rcpp 0.12.10 2017-03-19 CRAN (R 3.3.3) #> sf 0.4-2 2017-05-05 CRAN (R 3.4.0) #> stats * 3.4.0 2017-04-21 local #> tools 3.4.0 2017-04-21 local #> udunits2 0.13 2016-11-17 CRAN (R 3.3.2) #> units 0.4-4 2017-04-20 CRAN (R 3.3.3) #> utils * 3.4.0 2017-04-21 local ```
dpprdan commented 7 years ago

Looking good!

library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
url_shp <- "https://www.suche-postleitzahl.org/download_v1/webmercator/mittel/plz-5stellig/shapefile/points/plz-5stellig-centroid.shp.zip"
download.file(url_shp, "plz-5stellig-centroid.zip")
unzip("plz-5stellig-centroid.zip")
plz_sf <-
  read_sf(
    "plz-5stellig-centroid.shp",
    options = "ENCODING=UTF-8",
    stringsAsFactors = FALSE
  )
tail(plz_sf$note)
#> [1] "99986 Vogtei, Kammerforst u.a."      "99988 Südeichsfeld"                 
#> [3] "99991 Großengottern, Heroldishausen" "99994 Schlotheim"                   
#> [5] "99996 Menteroda, Obermehler"         "99998 Körner, Weinbergen"  

url_gj <- "https://www.suche-postleitzahl.org/download_v1/webmercator/mittel/plz-5stellig/geojson/points/plz-5stellig-centroid.geojson.zip"
download.file(url_gj, "plz-5stellig-centroid.geojson.zip")
unzip("plz-5stellig-centroid.geojson.zip")
plz_gj_sf <-
  st_read(
    "plz-5stellig-centroid.geojson",
    # options = "ENCODING=UTF-8", # option ENCODING not supported by GDAL GeoJSON driver
    stringsAsFactors = FALSE
  )
#> Reading layer `OGRGeoJSON' from data source `C:\Users\daniel\AppData\Local\Temp\RtmpY3aUvD\plz-5stellig-centroid.geojson' using driver `GeoJSON'
#> converted into: POINT
#> Simple feature collection with 8181 features and 4 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 666584 ymin: 6002128 xmax: 1667886 ymax: 7365790
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
tail(plz_gj_sf$note)
#> [1] "99986 Vogtei, Kammerforst u.a."      "99988 Südeichsfeld"                 
#> [3] "99991 Großengottern, Heroldishausen" "99994 Schlotheim"                   
#> [5] "99996 Menteroda, Obermehler"         "99998 Körner, Weinbergen"  
Session info ``` r devtools::session_info() #> Session info ------------------------------------------------------------- #> setting value #> version R version 3.4.0 (2017-04-21) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate German_Germany.1252 #> tz Europe/Berlin #> date 2017-05-12 #> Packages ----------------------------------------------------------------- #> package * version date source #> backports 1.0.5 2017-01-18 CRAN (R 3.3.2) #> base * 3.4.0 2017-04-21 local #> compiler 3.4.0 2017-04-21 local #> datasets * 3.4.0 2017-04-21 local #> DBI 0.6-1 2017-04-01 CRAN (R 3.3.3) #> devtools 1.13.0 2017-05-08 CRAN (R 3.4.0) #> digest 0.6.12 2017-01-27 CRAN (R 3.3.2) #> evaluate 0.10 2016-10-11 CRAN (R 3.3.1) #> graphics * 3.4.0 2017-04-21 local #> grDevices * 3.4.0 2017-04-21 local #> grid 3.4.0 2017-04-21 local #> htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0) #> knitr 1.15.1 2016-11-22 CRAN (R 3.3.2) #> magrittr 1.5 2014-11-22 CRAN (R 3.3.0) #> memoise 1.1.0 2017-04-21 CRAN (R 3.3.3) #> methods * 3.4.0 2017-04-21 local #> Rcpp 0.12.10 2017-03-19 CRAN (R 3.3.3) #> rmarkdown 1.5 2017-04-26 CRAN (R 3.3.3) #> rprojroot 1.2 2017-01-16 CRAN (R 3.3.2) #> sf * 0.4-3 2017-05-12 local #> stats * 3.4.0 2017-04-21 local #> stringi 1.1.5 2017-04-07 CRAN (R 3.3.3) #> stringr 1.2.0 2017-02-18 CRAN (R 3.3.3) #> tools 3.4.0 2017-04-21 local #> udunits2 0.13 2016-11-17 CRAN (R 3.3.2) #> units 0.4-4 2017-04-20 CRAN (R 3.3.3) #> utils * 3.4.0 2017-04-21 local #> withr 1.0.2 2016-06-20 CRAN (R 3.3.1) #> yaml 2.1.14 2016-11-12 CRAN (R 3.3.2) ```

Will you be releasing this shortly? And do I have to do anything to get the CRAN version once it's out?

edzer commented 7 years ago

Great, thanks. Yes, will release soon.

On 12 May 2017 15:55:40 CEST, Daniel Possenriede notifications@github.com wrote:

Looking good!

library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
url_shp <-
"https://www.suche-postleitzahl.org/download_v1/webmercator/mittel/plz-5stellig/shapefile/points/plz-5stellig-centroid.shp.zip"
download.file(url_shp, "plz-5stellig-centroid.zip")
unzip("plz-5stellig-centroid.zip")
plz_sf <-
 read_sf(
   "plz-5stellig-centroid.shp",
   options = "ENCODING=UTF-8",
   stringsAsFactors = FALSE
 )
tail(plz_sf$note)
#> [1] "99986 Vogtei, Kammerforst u.a."      "99988 Südeichsfeld"      

#> [3] "99991 Großengottern, Heroldishausen" "99994 Schlotheim"        

#> [5] "99996 Menteroda, Obermehler"         "99998 Körner, Weinbergen"

url_gj <-
"https://www.suche-postleitzahl.org/download_v1/webmercator/mittel/plz-5stellig/geojson/points/plz-5stellig-centroid.geojson.zip"
download.file(url_gj, "plz-5stellig-centroid.geojson.zip")
unzip("plz-5stellig-centroid.geojson.zip")
plz_gj_sf <-
 st_read(
   "plz-5stellig-centroid.geojson",
# options = "ENCODING=UTF-8", # option ENCODING not supported by GDAL
GeoJSON driver
   stringsAsFactors = FALSE
 )
#> Reading layer `OGRGeoJSON' from data source
`C:\Users\daniel\AppData\Local\Temp\RtmpY3aUvD\plz-5stellig-centroid.geojson'
using driver `GeoJSON'
#> converted into: POINT
#> Simple feature collection with 8181 features and 4 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 666584 ymin: 6002128 xmax: 1667886 ymax:
7365790
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
tail(plz_gj_sf$note)
#> [1] "99986 Vogtei, Kammerforst u.a."      "99988 Südeichsfeld"      

#> [3] "99991 Großengottern, Heroldishausen" "99994 Schlotheim"        

#> [5] "99996 Menteroda, Obermehler"         "99998 Körner, Weinbergen"
Session info ``` r devtools::session_info() #> Session info ------------------------------------------------------------- #> setting value #> version R version 3.4.0 (2017-04-21) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate German_Germany.1252 #> tz Europe/Berlin #> date 2017-05-12 #> Packages ----------------------------------------------------------------- #> package * version date source #> backports 1.0.5 2017-01-18 CRAN (R 3.3.2) #> base * 3.4.0 2017-04-21 local #> compiler 3.4.0 2017-04-21 local #> datasets * 3.4.0 2017-04-21 local #> DBI 0.6-1 2017-04-01 CRAN (R 3.3.3) #> devtools 1.13.0 2017-05-08 CRAN (R 3.4.0) #> digest 0.6.12 2017-01-27 CRAN (R 3.3.2) #> evaluate 0.10 2016-10-11 CRAN (R 3.3.1) #> graphics * 3.4.0 2017-04-21 local #> grDevices * 3.4.0 2017-04-21 local #> grid 3.4.0 2017-04-21 local #> htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0) #> knitr 1.15.1 2016-11-22 CRAN (R 3.3.2) #> magrittr 1.5 2014-11-22 CRAN (R 3.3.0) #> memoise 1.1.0 2017-04-21 CRAN (R 3.3.3) #> methods * 3.4.0 2017-04-21 local #> Rcpp 0.12.10 2017-03-19 CRAN (R 3.3.3) #> rmarkdown 1.5 2017-04-26 CRAN (R 3.3.3) #> rprojroot 1.2 2017-01-16 CRAN (R 3.3.2) #> sf * 0.4-3 2017-05-12 local #> stats * 3.4.0 2017-04-21 local #> stringi 1.1.5 2017-04-07 CRAN (R 3.3.3) #> stringr 1.2.0 2017-02-18 CRAN (R 3.3.3) #> tools 3.4.0 2017-04-21 local #> udunits2 0.13 2016-11-17 CRAN (R 3.3.2) #> units 0.4-4 2017-04-20 CRAN (R 3.3.3) #> utils * 3.4.0 2017-04-21 local #> withr 1.0.2 2016-06-20 CRAN (R 3.3.1) #> yaml 2.1.14 2016-11-12 CRAN (R 3.3.2) ```

Will you be releasing this shortly? And do I have to do anything to get the CRAN version once it's out?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/edzer/sfr/issues/5#issuecomment-301083036

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

edzer commented 7 years ago

Released, see here.

Closing now, please reopen if new encoding issues come up!

fzenoni commented 7 years ago

It seems I have encountered an encoding issue with a specific GeoJSON. The unzipped file is about 220 MB, sorry about that.

library(sf)

url_gj <- 'http://data-mobility.irisnet.be/resources/parkingonroad-2016-01-01.json.zip'
download.file(url_gj, "parkingroad.json.zip")
unzip('parkingroad.json.zip')

pk_gj_sf <- st_read('parkingonroad.geojson', stringsAsFactors = FALSE)
head(pk_gj_sf$type_fr)
#> [1] "Reserv\xe9"                     "Ni reserv\xe9 ni reglement\xe9" "Ni reserv\xe9 ni reglement\xe9"
#> [4] "Ni reserv\xe9 ni reglement\xe9" "Ni reserv\xe9 ni reglement\xe9" "Ni reserv\xe9 ni reglement\xe9"
Session info ``` R version 3.4.1 (2017-06-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.3 LTS Matrix products: default BLAS: /usr/lib/openblas-base/libblas.so.3 LAPACK: /usr/lib/libopenblasp-r0.2.18.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] sf_0.5-4 loaded via a namespace (and not attached): [1] compiler_3.4.1 magrittr_1.5 DBI_0.7 tools_3.4.1 units_0.4-5 yaml_2.1.14 Rcpp_0.12.12 [8] udunits2_0.13 grid_3.4.1 ```
rsbivand commented 7 years ago

Unless the file includes information specifying the encoding of field contents, you may have to sort them out yourself. This looks like Windows CP1252 or latin1 or similar, so use iconv() to convert the values to your own platform, which is (Linux and OSX) UTF-8. Why anyone would still generate a portable file with non-portable encodings is hard to grasp.

On a Linux UTF-8 system:

> head(iconv(pk_gj_sf$type_fr, from="CP1252"))
[1] "Reservé"                  "Ni reservé ni reglementé"
[3] "Ni reservé ni reglementé" "Ni reservé ni reglementé"
[5] "Ni reservé ni reglementé" "Ni reservé ni reglementé"

Note that rgdal::readOGR() has encoding= and use_iconv= arguments:

> pk_gj_sf1 <- readOGR('parkingonroad.geojson', stringsAsFactors = FALSE, encoding="CP1252", use_iconv=TRUE)
OGR data source with driver: GeoJSON 
Source: "parkingonroad.geojson", layer: "parkingonroad"
with 265070 features
It has 29 fields
> head(pk_gj_sf1$type_fr)
[1] "Reservé"                  "Ni reservé ni reglementé"
[3] "Ni reservé ni reglementé" "Ni reservé ni reglementé"
[5] "Ni reservé ni reglementé" "Ni reservé ni reglementé"
dpprdan commented 7 years ago

@fzenoni: Like @rsbivand already pointed out, the problem is with the file encoding of the source file, not with sf. Actually, JSON files should be encoded in a unicode encoding (UTF-8, UTF-16 or UTF-32) according to the JSON specs (which apply to GeoJSON by extension). I guess sf/GDAL expects UTF-8 as the default?! So you could reach out to the issuer of the data to make them aware of this (and possibly send them this primer on character encodings :wink:).

Since you are stuck with what they currently provide for now, you could either use rgdal::readOGR() as suggested by @rsbivand and convert that to an sf object with st_as_sf().

Or you could convert the GeoJSON file to UTF-8 outside R first (see e.g. here) and then import it like you did with st_read(). This results in:

head(pk_gj_utf8_sf$type_fr)
#> [1] "Reservé"                  "Ni reservé ni reglementé" "Ni reservé ni reglementé"
#> [4] "Ni reservé ni reglementé" "Ni reservé ni reglementé" "Ni reservé ni reglementé"
fzenoni commented 7 years ago

@rsbivand and @dpprdan, I have no words to express my gratitude for you very prompt and complete answer. Being definitely a beginner in this field, I don't have that many strings to my bow, so this problem quickly turned into a cul-de-sac. Thanks a lot for the documentation as well, I was really eager to clarify some concepts, and I will certainly contact the issuer of the data.

dpprdan commented 7 years ago

Just as a heads-up: There is a similar problem when reading strings from a PostgreSQL database with st_read_db() on Windows (CP1252), e.g. "Münster" becomes "Münster". This could be fixed here if st_read_db() would also call set_utf8 (like st_read does). However, this should rather be fixed upstream, see https://github.com/tomoakin/RPostgreSQL/issues/52 and https://github.com/rstats-db/DBI/issues/116.

tblazina commented 5 years ago

@dpprdan Do you know of any resolution for this with regards to PostreSQL connections? When I use st_reads using a RPostgres::dbConnect() object, on a windows machine, all the encodings get messed up, for example "Fläche" becomes "Fläche".

dpprdan commented 5 years ago

@tblazina If I assume correctly that "Fläche" is a column name, then this is a bug in https://github.com/r-dbi/RPostgres/issues/172

tblazina commented 5 years ago

@dpprdan That is correct, the column name is "Fläche", but it's also all values within the data frame.

dpprdan commented 5 years ago

@tblazina

but it's also all values within the data frame.

I can't reproduce that (see below). Do you have current versions of {RPostgres} (1.1.1) and {sf} (0.7-2) installed? If you do, can you make a reproducible example?

library('RPostgres')
library('sf')
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3

pg_con <- 
  dbConnect(
    RPostgres::Postgres(),
    user = Sys.getenv("pg_localhost_user"),
    password = Sys.getenv("pg_localhost_user"), 
    dbname = "quis_local"
  )

nc <- st_read(
  system.file("gpkg/nc.gpkg", package="sf"), 
  query = 'SELECT "NAME" AS name, "geom" AS geometry FROM "nc.gpkg" LIMIT 3',
  quiet = TRUE
)
nc$name <- c("fläche", "flöche", "flüche")
names(nc)[names(nc)=="name"] <- "fl\u00e4che" # colname as UTF-8, else error
nc
#> Simple feature collection with 3 features and 1 field
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -81.74107 ymin: 36.23388 xmax: -80.43531 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
#>   fläche                       geometry
#> 1 fläche MULTIPOLYGON (((-81.47276 3...
#> 2 flöche MULTIPOLYGON (((-81.23989 3...
#> 3 flüche MULTIPOLYGON (((-80.45634 3...
dbWriteTable(conn = pg_con, name = "nc", value = nc, temporary = TRUE)
#> Note: method with signature 'DBIObject#sf' chosen for function 'dbDataType',
#>  target signature 'PqConnection#sf'.
#>  "PqConnection#ANY" would also be valid
(dbx <- st_read(dsn = pg_con, layer = "nc"))
#> Simple feature collection with 3 features and 1 field
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -81.74107 ymin: 36.23388 xmax: -80.43531 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
#>   flÃ.che                       geometry
#> 1  fläche MULTIPOLYGON (((-81.47276 3...
#> 2  flöche MULTIPOLYGON (((-81.23989 3...
#> 3  flüche MULTIPOLYGON (((-80.45634 3...

dbDisconnect(pg_con)
Session info ``` r devtools::session_info() #> - Session info ---------------------------------------------------------- #> setting value #> version R version 3.5.2 (2018-12-20) #> os Windows 10 x64 #> system x86_64, mingw32 #> ui RTerm #> language en #> collate German_Germany.1252 #> ctype German_Germany.1252 #> tz Europe/Berlin #> date 2019-02-14 #> #> - Packages -------------------------------------------------------------- #> package * version date lib source #> assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.1) #> backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.1) #> bit 1.1-14 2018-05-29 [1] CRAN (R 3.5.0) #> bit64 0.9-7 2017-05-08 [1] CRAN (R 3.5.0) #> blob 1.1.1 2018-03-25 [1] CRAN (R 3.5.1) #> callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.2) #> class 7.3-14 2015-08-30 [2] CRAN (R 3.5.2) #> classInt 0.3-1 2018-12-18 [1] CRAN (R 3.5.2) #> cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.1) #> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.1) #> DBI 1.0.0 2018-05-02 [1] CRAN (R 3.5.1) #> desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.1) #> devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.1) #> digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.1) #> e1071 1.7-0.1 2019-01-21 [1] CRAN (R 3.5.2) #> evaluate 0.13 2019-02-12 [1] CRAN (R 3.5.2) #> fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.1) #> glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.1) #> highr 0.7 2018-06-09 [1] CRAN (R 3.5.1) #> hms 0.4.2.9001 2019-01-23 [1] Github (tidyverse/hms@cb175bb) #> htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.1) #> knitr 1.21 2018-12-10 [1] CRAN (R 3.5.1) #> magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.1) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.1) #> pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.1) #> pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.1) #> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.1) #> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.1) #> processx 3.2.1 2018-12-05 [1] CRAN (R 3.5.1) #> ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.2) #> R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.1) #> Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.1) #> remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.1) #> rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2) #> rmarkdown 1.11 2018-12-08 [1] CRAN (R 3.5.1) #> RPostgres * 1.1.1 2018-05-06 [1] CRAN (R 3.5.1) #> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.1) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.1) #> sf * 0.7-2 2018-12-20 [1] CRAN (R 3.5.2) #> stringi 1.2.4 2018-07-20 [1] CRAN (R 3.5.1) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.5.2) #> testthat 2.0.1 2018-10-13 [1] CRAN (R 3.5.1) #> units 0.6-2 2018-12-05 [1] CRAN (R 3.5.1) #> usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.1) #> withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.1) #> xfun 0.4 2018-10-23 [1] CRAN (R 3.5.1) #> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.1) #> #> [1] C:/Users/daniel/Documents/.R/win-library #> [2] C:/Program Files/R/R-3.5.2/library ```
Habitat-Projects commented 3 years ago

Note : I was also having trouble with accented characters when connecting to a geodatabase through dbConnect and PostreSQL. For example: drv <- dbDriver("PostgreSQL") con <- dbConnect(drv, host = "localhost", user="NAME", port="###", password="PASSWORD", dbname="NAME") object <- st_read(con)

The problem was resolved when I switched to RPostgres::Postgres() for the driver, as suggested above: con <- dbConnect(RPostgres::Postgres(), host = "localhost", user="NAME", port="###", password="PASSWORD", dbname="NAME") object <- st_read(con)