hotosm / osm-export-tool-python

command line tool + Python library for exporting OSM in various file formats.
BSD 3-Clause "New" or "Revised" License
44 stars 11 forks source link

UTF-8 invalid bytes in clipping GeoJSON #14

Closed vdveen closed 5 years ago

vdveen commented 5 years ago

It seems that whenever a GeoJSON contains characters that aren't read by UTF-8, the code won't run. This does not seem the case with the online tool; using the same GeoJSON file works fine. The code also works fine when I remove these characters by hand from the GeoJSON; this is the _stripped file in the output below.

Error:

overmorgen_anne:~/osm_export$ osm-export-tool osm_inputs/AL-latest.osm.pbf outputs.shp -f shp -m forest.yaml --clip nuts_id_AL0.geojson

Traceback (most recent call last):
  File "/home/overmorgen_anne/.local/bin/osm-export-tool", line 6, in <module>
    main()
  File "/home/overmorgen_anne/.local/lib/python3.6/site-packages/osm_export_tool/cmd.py", line 35, in main
    clipping_geom = load_geometry(f.read())
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcb in position 291: invalid continuation byte

overmorgen_anne:~/osm_export$ osm-export-tool osm_inputs/AL-latest.osm.pbf outputs.shp -f shp -m forest.yaml --clip nuts_id_AL0_stripped.geojson

Warning: using first feature of --clip FeatureCollection.
Completed in 16.480865716934204 seconds.

Top part of GeoJSON with the byte in name_latn: and nuts_name after the P:

{
"type": "FeatureCollection",
"name": "nuts_id_AL0",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
"features": [ 
{ "type": "Feature", "properties": { "id": 58, "objectid": 58.0, "levl_code": 1, "nuts_id": "AL0", "cntr_code": "AL", 
"name_latn": "SHQIP�RIA", "nuts_name": "SHQIP�RIA", "name_asci": "SHQIPERIA"

GeoJSON used.

GeoJSON without invalid characters used.

bdon commented 5 years ago

I'm not able to reproduce the issue with the first GeoJSON you uploaded - "nuts_id_AL0.txt". Is it possible that when saving it as a .txt your editor converted it to valid UTF-8?

vdveen commented 5 years ago

That is entirely possible, yes. Here's a zip with the geojson used, unaltered. nuts_id_AL0.zip

bdon commented 5 years ago

I still can't reproduce it with this file. Can you try validating the JSON that won't work on a site like https://onlineutf8tools.com/validate-utf8 to ensure that it is valid utf-8?

vdveen commented 5 years ago

Weird... I am able to reproduce it with the above uploaded file, see _ul.geojson below.

overmorgen_anne@Delft:~/osm_export$ osm-export-tool osm_inputs/AL-latest.osm.pbf outputs/bug.shp -f shp -m forest.yaml --clip nuts_geb_json/nuts_id_AL0_ul.geojson
Traceback (most recent call last):
  File "/home/overmorgen_anne/.local/bin/osm-export-tool", line 6, in <module>
    main()
  File "/home/overmorgen_anne/.local/lib/python3.6/site-packages/osm_export_tool/cmd.py", line 35, in main
    clipping_geom = load_geometry(f.read())
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcb in position 291: invalid continuation byte

Copying the text from the GeoJSON into your tool produces no invalid characters. Here's what it looks like in VS Code:

20191030 -Window

Copying the selected text into the validator gives no issues: 20191030 -Window 2

But the example invalid code also doesn't work properly, so maybe that tool isn't entirely working as intended: 20191030 -Window 3

Below what I copied. The first <?> in name_latn matches exactly with position 291 mentioned in the error log above, by the way.

{
"type": "FeatureCollection",
"name": "nuts_id_AL0",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
"features": [
{ "type": "Feature", "properties": { "id": 58, "objectid": 58.0, "levl_code": 1, "nuts_id": "AL0", "cntr_code": "AL", "name_latn": "SHQIP�RIA", "nuts_name": "SHQIP�RIA", "name_asci": "SHQIPERIA", "name_html": "SHQIP&#x00CB;RIA", "name_engl": "Albania", "name_fren": "Albanie", "iso3_code": "ALB", "svrg_un": "UN Member State", "capt": 
bdon commented 5 years ago

@vdveen can you try opening the invalid file with a simple python script like this:

import sys

with open(sys.argv[1],'r') as f:
  print(f.read())

and see if that raises the exception? If so, it may indicate an issue with your python installation or system locale

vdveen commented 5 years ago

Calling that code (saved as py_opener.py) with regular Python is fine, but with Python 3 it gives the same error:

overmorgen_anne@Delft:~/osm_export$ python3 py_opener.py nuts_geb_json/nuts_id_AL0.geojson
Traceback (most recent call last):
  File "py_opener.py", line 4, in <module>
    print(f.read())
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcb in position 291: invalid continuation byte

I updated my python to 3.7 but this did not change anything. What I did find out, was that the error-throwing geojson file is encoded as ISO-8859-1 and not UTF-8. When I open the file in VSCode, save the file with UTF-8 encoding, and re-run the above the error goes away. When I turn it back to ISO-8859-1the error reappears, as the open() command assumes the locale-defined encoding unless otherwise specified:

>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'

Perhaps you have a different locale defined?

This does point to a solution though; if I write a script to force all file encoding to be UTF-8, or if I change my preferred encoding, it should work. Unless you can/want to do something about encoding in the script, this issue can probably be closed, right?

bdon commented 5 years ago

I think it's reasonable to demand that the user only inputs UTF-8 files. This is the assumption for Python version 3 - Python 2 will be officially EOL in 2 months. Otherwise we will have to put in special code to check for any number of different encodings.

it makes sense though that there is no error thrown in Python 2 because it will interpret everything as single byte ASCII characters.