st_read() breaks with column names that have weird encodings

theiostream commented 4 years ago

I tried to import the following shapefile using st_read: http://dados.prefeitura.sp.gov.br/dataset/b7242ef1-3add-4ce9-8e74-af7f9288762a/resource/84055ad9-49d6-46dc-af08-573ae1012d48/download/layerfavelas2015.zip.

It broke with the following error:

Error in make.names(vnames, unique = TRUE) : string multibyte inválida 2
Erros durante o embrulho: string multibyte inválida em '<b2>'

(Sorry for the non-English error message, but it's clearly just saying "invalid multibyte string").

I assumed this had something to do with encoding, so I opened the shapefile in QGIS, changed the column name with the weird character sequence, and st_read imported fine.

I wonder if st_read() should have a locale option, though? Or maybe one already exists and I didn't realize?

edzer commented 4 years ago

I'm a bit at a loss whether this actually should work: rgdal::readOGR breaks on the same error, as does foreign::read.dbf("LAYER_FAVELAS_2015/DEINFO_FAVELAS_2015.dbf"). How do you set your locale? Can you read the dbf with read.dbf?

rsbivand commented 4 years ago

The DBF read into LibreOffice Calc is:

read with ISO-8859-1. Editing out the square metres resolves the problem. I'll look to see where something might be done in rgdal::ogrInfo(). Further, the cpg file says System, which is wrong. With the original DBF:

> ogrInfo("DEINFO_FAVELAS_2015.shp")
Source: "/home/rsb/tmp/bigshape/LAYER_FAVELAS_2015/DEINFO_FAVELAS_2015.shp", layer: "DEINFO_FAVELAS_2015"
Driver: ESRI Shapefile; number of rows: 1677 
Feature type: wkbPolygon with 2 dimensions
Extent: (315618.7 7358833) - (360628 7411543)
CRS: +proj=utm +zone=23 +south +ellps=aust_SA +towgs84=-66.87,4.37,-38.52,0,0,0,0 +units=m +no_defs 
LDID: 0 
Number of fields: 7 
        name type length  typeName
1         ID    4      5    String
2 AREA_m\xb2   12     10 Integer64
3       NOME    4    100    String
4  DATULTATZ    9     10      Date
5   ENDERECO    4    200    String
6    NOMESEC    4    150    String
7    PROPRTR    4    100    String
Warning message:
In OGRSpatialRef(dsn, layer, morphFromESRI = morphFromESRI, dumpSRS = dumpSRS,  :
  Discarded datum South_American_Datum_1969 in CRS definition: +proj=utm +zone=23 +south +ellps=aust_SA +towgs84=-66.87,4.37,-38.52,0,0,0,0 +units=m +no_defs

rsbivand commented 4 years ago

I refer to https://cran.r-project.org/web/packages/rgdal/vignettes/OGR_shape_encoding.pdf, and find on my system with iconv that:

library(rgdal)
getCPLConfigOption("SHAPE_ENCODING")
setCPLConfigOption("SHAPE_ENCODING", "CP1250")
o <- ogrInfo("DEINFO_FAVELAS_2015.shp")
o
oo <- readOGR("DEINFO_FAVELAS_2015.shp")
setCPLConfigOption("SHAPE_ENCODING", NULL)

works. I'm unsure whether the CP is correct, but at least it doesn't break.

edzer commented 4 years ago

Thanks! After

Sys.setenv("SHAPE_ENCODING"= "CP1250")

I also get

> library(sf)
Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
> read_sf("/tmp/LAYER_FAVELAS_2015/")
Simple feature collection with 1677 features and 7 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: 315618.7 ymin: 7358833 xmax: 360628 ymax: 7411543
projected CRS:  SAD69 / UTM zone 23S
# A tibble: 1,677 x 8
   ID    AREA_m. NOME  DATULTATZ  ENDERECO NOMESEC PROPRTR
   <chr>   <dbl> <chr> <date>     <chr>    <chr>   <chr>  
 1 1       23687 Parq… 2011-07-04 Rua Dom… Pirapo… Munici…
 2 2         404 Vila… 2011-06-16 Rua Con… NA      Munici…
 3 3         486 Pedr… 2012-04-18 Avenida… NA      Munici…
 4 4         453 Tols… 2011-01-19 Avenida… NA      NA     
 5 5       20843 Vila… 2011-08-11 R Júlio… Justin… Munici…
 6 6        3416 Sant… 2011-08-11 Rua Dua… NA      NA     
 7 7        1455 Esme… 2012-02-10 Avenida… NA      Munici…
 8 8        1658 Vila… 2010-11-23 Av. Pro… Vila D… Munici…
 9 9        4155 Viel… 2011-06-16 Rua Ope… Viela … NA     
10 10       6129 Mauro 2011-01-19 Avenida… Whitak… Munici…
# … with 1,667 more rows, and 1 more variable: geometry <MULTIPOLYGON [m]>

theiostream commented 4 years ago

Would it make sense for st_read() to take an encoding option, then? Taking values like those of SHAPE_ENCODING, that is.

rsbivand commented 4 years ago

No, because the example shows how to use an environment variable. In rgdal, the problem was resolved by using GDAL's internal CPL variables 12 years ago, but then shapefiles needed to be read and written often. ESRI shapefiles are end-of-life, also for ESRI. So existing files should be read once, and then written preferably as GeoPackage.

EmsAlan commented 2 years ago

This is not practical at all. I have the same problem and the thing is that I didn't know the encoding environment for other computer. Furthermore, there are always old files may or have to access in the future, just because you say everyone should switch to another preferably as GeoPackage, you still have old files to handle. Can't you just fix this and do something like python's parameters >>> "char_decode_errors='replace', encoding='utf-8'? At least when people are using your package won't have to fix your encoding problems or just like how you replied, there's not way to read the data anymore.

RJGrayEcology commented 1 year ago

I have the same problem. Is there any fix for this??

edzer commented 1 year ago

I have the same problem. Is there any fix for this??

Did you try the fix suggested above?

rsbivand commented 1 year ago

And please try to find out the originating encoding, the likelihood of introducing errors by guessing in software is much greater than for users who should know something of the provenance of their data.

This is really only a problem for originating ESRI Shapefiles, and for originating older Windows systems. In https://en.wikipedia.org/wiki/Windows_code_page (see also https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers), you'll see some candidates. Given your engagement in various parts of the world, you'll need to work backwards to the possible source code pages (https://www.rjgrayecology.com/about.html#/). One possibility is to examine any *.cpg file provided in the multi-file ESRI Shapefile bundle you have received, for example. Another is to try to read the *.dbf file into a GUI spreadsheet - for me LibreOffice Calc starts with a list of possible input encodings: Reading: point.zip as Western Europe (ISO-8859-1) aka Latin1 gives: which is wrong. Reading as Eastern Europe (Windows-1250/WinLatin 2) gives: which is correct. (This is the example from the defunct rgdal vignette).

Once you think that you have a candidate, use the suggestions above to try:

sf::gdal_utils("ogrinfo", "point.shp", options="-al")

giving for me (rendered on a UTF-8 platform):

  NAZEV (String) = St°Ýte× nad Ludinou

and with Sys.setenv("SHAPE_ENCODING"="CP1250"):

  NAZEV (String) = Střítež nad Ludinou

The expectation of ESRI and Microsoft was (20+ years ago) that data would stay within the market in which the software product was sold. That is why Microsoft now uses UTF-8 (or derivatives) like macOS, Linux, iOS, Android, etc., and R on Windows from 4.2 uses Windows Universal C Runtime (UCRT) rather than CP and version-specific CRTs.

RJGrayEcology commented 1 year ago

I was able to mitigate my issue (Vietnamese characters) by adding options = "ENCODING=WINDOWS-1252" to the st_read() function

rsbivand commented 1 year ago

Maybe re-try with 1258: https://en.wikipedia.org/wiki/Windows-1258 ? You'd need to check whether the specific differences between CPs don't impact your interpretation of the data.

r-spatial / sf

st_read() breaks with column names that have weird encodings #1427