Closed vdveen closed 5 years ago
I'm not able to reproduce the issue with the first GeoJSON you uploaded - "nuts_id_AL0.txt". Is it possible that when saving it as a .txt your editor converted it to valid UTF-8?
That is entirely possible, yes. Here's a zip with the geojson used, unaltered. nuts_id_AL0.zip
I still can't reproduce it with this file. Can you try validating the JSON that won't work on a site like https://onlineutf8tools.com/validate-utf8 to ensure that it is valid utf-8?
Weird... I am able to reproduce it with the above uploaded file, see _ul.geojson below.
overmorgen_anne@Delft:~/osm_export$ osm-export-tool osm_inputs/AL-latest.osm.pbf outputs/bug.shp -f shp -m forest.yaml --clip nuts_geb_json/nuts_id_AL0_ul.geojson
Traceback (most recent call last):
File "/home/overmorgen_anne/.local/bin/osm-export-tool", line 6, in <module>
main()
File "/home/overmorgen_anne/.local/lib/python3.6/site-packages/osm_export_tool/cmd.py", line 35, in main
clipping_geom = load_geometry(f.read())
File "/usr/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcb in position 291: invalid continuation byte
Copying the text from the GeoJSON into your tool produces no invalid characters. Here's what it looks like in VS Code:
Copying the selected text into the validator gives no issues:
But the example invalid code also doesn't work properly, so maybe that tool isn't entirely working as intended:
Below what I copied. The first <?> in name_latn
matches exactly with position 291 mentioned in the error log above, by the way.
{
"type": "FeatureCollection",
"name": "nuts_id_AL0",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
"features": [
{ "type": "Feature", "properties": { "id": 58, "objectid": 58.0, "levl_code": 1, "nuts_id": "AL0", "cntr_code": "AL", "name_latn": "SHQIP�RIA", "nuts_name": "SHQIP�RIA", "name_asci": "SHQIPERIA", "name_html": "SHQIPËRIA", "name_engl": "Albania", "name_fren": "Albanie", "iso3_code": "ALB", "svrg_un": "UN Member State", "capt":
@vdveen can you try opening the invalid file with a simple python script like this:
import sys
with open(sys.argv[1],'r') as f:
print(f.read())
and see if that raises the exception? If so, it may indicate an issue with your python installation or system locale
Calling that code (saved as py_opener.py) with regular Python is fine, but with Python 3 it gives the same error:
overmorgen_anne@Delft:~/osm_export$ python3 py_opener.py nuts_geb_json/nuts_id_AL0.geojson
Traceback (most recent call last):
File "py_opener.py", line 4, in <module>
print(f.read())
File "/usr/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcb in position 291: invalid continuation byte
I updated my python to 3.7 but this did not change anything. What I did find out, was that the error-throwing geojson file is encoded as ISO-8859-1 and not UTF-8. When I open the file in VSCode, save the file with UTF-8 encoding, and re-run the above the error goes away. When I turn it back to ISO-8859-1the error reappears, as the open()
command assumes the locale-defined encoding unless otherwise specified:
>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'
Perhaps you have a different locale defined?
This does point to a solution though; if I write a script to force all file encoding to be UTF-8, or if I change my preferred encoding, it should work. Unless you can/want to do something about encoding in the script, this issue can probably be closed, right?
I think it's reasonable to demand that the user only inputs UTF-8 files. This is the assumption for Python version 3 - Python 2 will be officially EOL in 2 months. Otherwise we will have to put in special code to check for any number of different encodings.
it makes sense though that there is no error thrown in Python 2 because it will interpret everything as single byte ASCII characters.
It seems that whenever a GeoJSON contains characters that aren't read by UTF-8, the code won't run. This does not seem the case with the online tool; using the same GeoJSON file works fine. The code also works fine when I remove these characters by hand from the GeoJSON; this is the
_stripped
file in the output below.Error:
Top part of GeoJSON with the byte in name_latn: and nuts_name after the P:
GeoJSON used.
GeoJSON without invalid characters used.