Closed theiostream closed 4 years ago
I'm a bit at a loss whether this actually should work: rgdal::readOGR
breaks on the same error, as does foreign::read.dbf("LAYER_FAVELAS_2015/DEINFO_FAVELAS_2015.dbf")
. How do you set your locale? Can you read the dbf with read.dbf
?
The DBF read into LibreOffice Calc is:
read with ISO-8859-1. Editing out the square metres resolves the problem. I'll look to see where something might be done in rgdal::ogrInfo()
. Further, the cpg file says System
, which is wrong. With the original DBF:
> ogrInfo("DEINFO_FAVELAS_2015.shp")
Source: "/home/rsb/tmp/bigshape/LAYER_FAVELAS_2015/DEINFO_FAVELAS_2015.shp", layer: "DEINFO_FAVELAS_2015"
Driver: ESRI Shapefile; number of rows: 1677
Feature type: wkbPolygon with 2 dimensions
Extent: (315618.7 7358833) - (360628 7411543)
CRS: +proj=utm +zone=23 +south +ellps=aust_SA +towgs84=-66.87,4.37,-38.52,0,0,0,0 +units=m +no_defs
LDID: 0
Number of fields: 7
name type length typeName
1 ID 4 5 String
2 AREA_m\xb2 12 10 Integer64
3 NOME 4 100 String
4 DATULTATZ 9 10 Date
5 ENDERECO 4 200 String
6 NOMESEC 4 150 String
7 PROPRTR 4 100 String
Warning message:
In OGRSpatialRef(dsn, layer, morphFromESRI = morphFromESRI, dumpSRS = dumpSRS, :
Discarded datum South_American_Datum_1969 in CRS definition: +proj=utm +zone=23 +south +ellps=aust_SA +towgs84=-66.87,4.37,-38.52,0,0,0,0 +units=m +no_defs
I refer to https://cran.r-project.org/web/packages/rgdal/vignettes/OGR_shape_encoding.pdf, and find on my system with iconv that:
library(rgdal)
getCPLConfigOption("SHAPE_ENCODING")
setCPLConfigOption("SHAPE_ENCODING", "CP1250")
o <- ogrInfo("DEINFO_FAVELAS_2015.shp")
o
oo <- readOGR("DEINFO_FAVELAS_2015.shp")
setCPLConfigOption("SHAPE_ENCODING", NULL)
works. I'm unsure whether the CP is correct, but at least it doesn't break.
Thanks! After
Sys.setenv("SHAPE_ENCODING"= "CP1250")
I also get
> library(sf)
Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
> read_sf("/tmp/LAYER_FAVELAS_2015/")
Simple feature collection with 1677 features and 7 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 315618.7 ymin: 7358833 xmax: 360628 ymax: 7411543
projected CRS: SAD69 / UTM zone 23S
# A tibble: 1,677 x 8
ID AREA_m. NOME DATULTATZ ENDERECO NOMESEC PROPRTR
<chr> <dbl> <chr> <date> <chr> <chr> <chr>
1 1 23687 Parq… 2011-07-04 Rua Dom… Pirapo… Munici…
2 2 404 Vila… 2011-06-16 Rua Con… NA Munici…
3 3 486 Pedr… 2012-04-18 Avenida… NA Munici…
4 4 453 Tols… 2011-01-19 Avenida… NA NA
5 5 20843 Vila… 2011-08-11 R Júlio… Justin… Munici…
6 6 3416 Sant… 2011-08-11 Rua Dua… NA NA
7 7 1455 Esme… 2012-02-10 Avenida… NA Munici…
8 8 1658 Vila… 2010-11-23 Av. Pro… Vila D… Munici…
9 9 4155 Viel… 2011-06-16 Rua Ope… Viela … NA
10 10 6129 Mauro 2011-01-19 Avenida… Whitak… Munici…
# … with 1,667 more rows, and 1 more variable: geometry <MULTIPOLYGON [m]>
Would it make sense for st_read()
to take an encoding option, then? Taking values like those of SHAPE_ENCODING
, that is.
No, because the example shows how to use an environment variable. In rgdal, the problem was resolved by using GDAL's internal CPL variables 12 years ago, but then shapefiles needed to be read and written often. ESRI shapefiles are end-of-life, also for ESRI. So existing files should be read once, and then written preferably as GeoPackage.
This is not practical at all. I have the same problem and the thing is that I didn't know the encoding environment for other computer. Furthermore, there are always old files may or have to access in the future, just because you say everyone should switch to another preferably as GeoPackage, you still have old files to handle. Can't you just fix this and do something like python's parameters >>> "char_decode_errors='replace', encoding='utf-8'? At least when people are using your package won't have to fix your encoding problems or just like how you replied, there's not way to read the data anymore.
I have the same problem. Is there any fix for this??
I have the same problem. Is there any fix for this??
Did you try the fix suggested above?
And please try to find out the originating encoding, the likelihood of introducing errors by guessing in software is much greater than for users who should know something of the provenance of their data.
This is really only a problem for originating ESRI Shapefiles, and for originating older Windows systems. In https://en.wikipedia.org/wiki/Windows_code_page (see also https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers), you'll see some candidates. Given your engagement in various parts of the world, you'll need to work backwards to the possible source code pages (https://www.rjgrayecology.com/about.html#/). One possibility is to examine any *.cpg
file provided in the multi-file ESRI Shapefile bundle you have received, for example. Another is to try to read the *.dbf
file into a GUI spreadsheet - for me LibreOffice Calc starts with a list of possible input encodings:
Reading:
point.zip
as Western Europe (ISO-8859-1) aka Latin1 gives:
which is wrong. Reading as Eastern Europe (Windows-1250/WinLatin 2) gives:
which is correct. (This is the example from the defunct rgdal
vignette).
Once you think that you have a candidate, use the suggestions above to try:
sf::gdal_utils("ogrinfo", "point.shp", options="-al")
giving for me (rendered on a UTF-8 platform):
NAZEV (String) = St°Ýte× nad Ludinou
and with Sys.setenv("SHAPE_ENCODING"="CP1250")
:
NAZEV (String) = Střítež nad Ludinou
The expectation of ESRI and Microsoft was (20+ years ago) that data would stay within the market in which the software product was sold. That is why Microsoft now uses UTF-8 (or derivatives) like macOS, Linux, iOS, Android, etc., and R on Windows from 4.2 uses Windows Universal C Runtime (UCRT) rather than CP and version-specific CRTs.
I was able to mitigate my issue (Vietnamese characters) by adding options = "ENCODING=WINDOWS-1252" to the st_read() function
Maybe re-try with 1258: https://en.wikipedia.org/wiki/Windows-1258 ? You'd need to check whether the specific differences between CPs don't impact your interpretation of the data.
I tried to import the following shapefile using
st_read
: http://dados.prefeitura.sp.gov.br/dataset/b7242ef1-3add-4ce9-8e74-af7f9288762a/resource/84055ad9-49d6-46dc-af08-573ae1012d48/download/layerfavelas2015.zip.It broke with the following error:
(Sorry for the non-English error message, but it's clearly just saying "invalid multibyte string").
I assumed this had something to do with encoding, so I opened the shapefile in QGIS, changed the column name with the weird character sequence, and
st_read
imported fine.I wonder if
st_read()
should have a locale option, though? Or maybe one already exists and I didn't realize?