stuartemiddleton / geoparsepy

geoparsepy is a Python geoparsing library that will extract and disambiguate locations from text. It uses a local OpenStreetMap database which allows very high and unlimited geoparsing throughput, unlike approaches that use a third-party geocoding service (e.g. Google Geocoding API). this repository holds Python examples to use the PyPI library.
Other
54 stars 4 forks source link

possible installation problem with data files encoding #7

Closed ogorun closed 2 years ago

ogorun commented 3 years ago

There is a possible problem with either database dump files (encoding?) or installation instructions. (sorry if it's better suited to troubleshooting questions than issues, I didn't find another way to communicate)

Attempt to import database files

psql -h localhost -U postgres -d openstreetmap -f global_cities.sql

leads to many syntax errors like

psql:global_cities.sql:81540: ERROR: syntax error at or near "章丘" LINE 1: 章丘", "wikidata"=>"Q197392", "wikipedia"=>"en:Zhangqiu Di...

Installation/environment notes:

$ cat /etc/issue
Ubuntu 18.04.5 LTS \n \l
$ echo LANG
en_US.UTF-8

Since the dump files were created with postgres 11.3, it was installed with the following Docker file

FROM postgres:11.3

RUN apt-get update \
    && apt-get install wget -y \
    && apt-get install postgresql-11-postgis-3 -y \
    && apt-get install postgresql-11-postgis-3-scripts -y \
    && apt-get install postgis -y

COPY ./db.sql /docker-entrypoint-initdb.d/

# docker run --name pg-docker -e POSTGRES_PASSWORD=docker -d -p 5432:5432 -v $HOME/docker/volumes/postgres:/var/lib/postgresql/data postres_posgis11:latest
#psql -h localhost -U postgres -d openstreetmap -f global_cities.sql
#psql -h localhost -U postgres -d openstreetmap -f uk_places.sql
#psql -h localhost -U postgres -d openstreetmap -f north_america_places.sql
#psql -h localhost -U postgres -d openstreetmap -f europe_places.sql

(base) olga@anyclt104:~/workspace/edison/geoparsy_test/external/db$ cat db.sql CREATE DATABASE openstreetmap; CREATE EXTENSION IF NOT EXISTS postgis; CREATE EXTENSION IF NOT EXISTS fuzzystrmatch; CREATE EXTENSION IF NOT EXISTS postgis_tiger_geocoder; CREATE EXTENSION IF NOT EXISTS hstore;

stuartemiddleton commented 3 years ago

Others have got this working OK, so I am confident the files are fine. Looks like it might be the BOM character causing your problem.

see this closed issue which might help https://github.com/stuartemiddleton/geoparsepy/issues/5

Also check the file is downloaded correctly [SQL dump is a 1.2 GB tar/zip file created using pg_dump and zipped using 7Zip tool]

ogorun commented 3 years ago

@stuartemiddleton

Thank you for response. I indeed see 1.2G archive and the files were extracted without error messages. I tried solution with BOM from #5 (sorry that I didn't mention it previously) and it didn't help. I asked one more person to follow your installation instractions with the same result - syntax errors (he tried on Ubuntu 20.04, I have Ubuntu 18.04).

May it be an OS-related problem? Can you share the OS version and language environment settings you used?

stuartemiddleton commented 3 years ago

Python 3.7, Postgres 11.3 (UTF-8), tested on Windows 10 and Ubuntu 18.04 LTS.

You are using docker which I do not. Perhaps the SQL file being converted (causing BOM corruption) during upload to the container? I vaguely remember having issues years ago with docker and UTF-8 files (but I cannot remember any more details). Might be an issue if the host machine has a different OS to the docker container.

I suggest you inspect the initial byte profile in the UTF-8 SQL file on the docker container, and see exactly whats going on.

ogorun commented 3 years ago

@stuartemiddleton ,

I changed the Postgres installation to one without docker but the result didn't change. One interesting thing I noted in dump files is the collation used for DB creation.

CREATE DATABASE openstreetmap WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 'English_United States.1252' LC_CTYPE = 'English_United States.1252';

It looks strange. Windows-1252 is not UTF-8 encoding. How is it defined? Which locale can I use in Linux (Ubuntu) to create such collation in Postgres?

stuartemiddleton commented 3 years ago

I did a unit test on Ubuntu 20.04 LTS (clean install using sql files from google drive) and it worked OK.

I did change user 'sem' to 'sem03' using below commands, but this will not impact your docker encoding issue.

find -name \.sql -exec sed -i "s/TO sem;/TO sem03;/g" {} \; psql -d openstreetmap -f uk_places.sql psql -d openstreetmap -f global_cities.sql psql -d openstreetmap -f north_america_places.sql psql -d openstreetmap -f europe_places.sql

I can confirm the encoding in sql dump worked fine for me (see below its as you have it).

CREATE DATABASE openstreetmap WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 'English_United States.1252' LC_CTYPE = 'English_United States.1252';

The only thing I can think of if your docker environment is somehow changing the encoding on upload is to (a) work out what the encoding actually is in your docker version of the sql file then (b) change it using iconv - see below for example converting UTF-16 to UTF-8

iconv -f UTF-16LE -t UTF-8 uk_places.sql -o uk_places_utf8.sql

ogorun commented 3 years ago

@stuartemiddleton , thank you for your response. I just saw the message.

I don't use docker anymore, it's not relevant. Trying on Ubuntu 20.04

olga@AlgoServerSmall:/media/data/osm$ cat /etc/issue
Ubuntu 20.04.2 LTS \n \l

CREATE DATABASE openstreetmap_test WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 'English_United States.1252' LC_CTYPE = 'English_United States.1252';

leads to ERROR: invalid locale name: "English_United States.1252"

olga@AlgoServerSmall:/media/data/osm$ sudo locale-gen 'English_United States.1252'
[sudo] password for olga: 
Error: 'English_United States.1252' is not a supported language or locale
olga@AlgoServerSmall:/media/data/osm$ locale -a
C
C.UTF-8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IL
en_IL.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US
en_US.iso88591
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
he_IL.utf8
POSIX
ru_RU.utf8
ru_UA.utf8

Is it possible that you use 32-bit Ubuntu? Could you put here output of

uname -a locale -a

SELECT * FROM pg_collation;

?

stuartemiddleton commented 3 years ago

its 64 bit standard install. data below you asked for.

uname -a Linux UOS-19090 5.8.0-45-generic #51~20.04.1-Ubuntu SMP Tue Feb 23 13:46:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

locale -a C C.UTF-8 en_AG en_AG.utf8 en_AU.utf8 en_BW.utf8 en_CA.utf8 en_DK.utf8 en_GB.utf8 en_HK.utf8 en_IE.utf8 en_IL en_IL.utf8 en_IN en_IN.utf8 en_NG en_NG.utf8 en_NZ.utf8 en_PH.utf8 en_SG.utf8 en_US.utf8 en_ZA.utf8 en_ZM en_ZM.utf8 en_ZW.utf8 POSIX

SELECT * FROM pg_collation; oid | collname | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype | collversion -------+------------------------+---------------+-----------+--------------+---------------------+--------------+------------------+------------------+------------- 100 | default | 11 | 10 | d | t | -1 | | | 950 | C | 11 | 10 | c | t | -1 | C | C | 951 | POSIX | 11 | 10 | c | t | -1 | POSIX | POSIX | 12326 | ucs_basic | 11 | 10 | c | t | 6 | C | C | 12327 | C.UTF-8 | 11 | 10 | c | t | 6 | C.UTF-8 | C.UTF-8 | 12328 | en_AG | 11 | 10 | c | t | 6 | en_AG | en_AG | 12329 | en_AG.utf8 | 11 | 10 | c | t | 6 | en_AG.utf8 | en_AG.utf8 | 12330 | en_AU.utf8 | 11 | 10 | c | t | 6 | en_AU.utf8 | en_AU.utf8 | 12331 | en_BW.utf8 | 11 | 10 | c | t | 6 | en_BW.utf8 | en_BW.utf8 | 12332 | en_CA.utf8 | 11 | 10 | c | t | 6 | en_CA.utf8 | en_CA.utf8 | 12333 | en_DK.utf8 | 11 | 10 | c | t | 6 | en_DK.utf8 | en_DK.utf8 | 12334 | en_GB.utf8 | 11 | 10 | c | t | 6 | en_GB.utf8 | en_GB.utf8 | 12335 | en_HK.utf8 | 11 | 10 | c | t | 6 | en_HK.utf8 | en_HK.utf8 | 12336 | en_IE.utf8 | 11 | 10 | c | t | 6 | en_IE.utf8 | en_IE.utf8 | 12337 | en_IL | 11 | 10 | c | t | 6 | en_IL | en_IL | 12338 | en_IL.utf8 | 11 | 10 | c | t | 6 | en_IL.utf8 | en_IL.utf8 | 12339 | en_IN | 11 | 10 | c | t | 6 | en_IN | en_IN | 12340 | en_IN.utf8 | 11 | 10 | c | t | 6 | en_IN.utf8 | en_IN.utf8 | 12341 | en_NG | 11 | 10 | c | t | 6 | en_NG | en_NG | 12342 | en_NG.utf8 | 11 | 10 | c | t | 6 | en_NG.utf8 | en_NG.utf8 | 12343 | en_NZ.utf8 | 11 | 10 | c | t | 6 | en_NZ.utf8 | en_NZ.utf8 | 12344 | en_PH.utf8 | 11 | 10 | c | t | 6 | en_PH.utf8 | en_PH.utf8 | 12345 | en_SG.utf8 | 11 | 10 | c | t | 6 | en_SG.utf8 | en_SG.utf8 | 12346 | en_US.utf8 | 11 | 10 | c | t | 6 | en_US.utf8 | en_US.utf8 | 12347 | en_ZA.utf8 | 11 | 10 | c | t | 6 | en_ZA.utf8 | en_ZA.utf8 | 12348 | en_ZM | 11 | 10 | c | t | 6 | en_ZM | en_ZM | 12349 | en_ZM.utf8 | 11 | 10 | c | t | 6 | en_ZM.utf8 | en_ZM.utf8 | 12350 | en_ZW.utf8 | 11 | 10 | c | t | 6 | en_ZW.utf8 | en_ZW.utf8 | 12351 | en_AU | 11 | 10 | c | t | 6 | en_AU.utf8 | en_AU.utf8 | 12352 | en_BW | 11 | 10 | c | t | 6 | en_BW.utf8 | en_BW.utf8 | 12353 | en_CA | 11 | 10 | c | t | 6 | en_CA.utf8 | en_CA.utf8 | 12354 | en_DK | 11 | 10 | c | t | 6 | en_DK.utf8 | en_DK.utf8 | 12355 | en_GB | 11 | 10 | c | t | 6 | en_GB.utf8 | en_GB.utf8 | 12356 | en_HK | 11 | 10 | c | t | 6 | en_HK.utf8 | en_HK.utf8 | 12357 | en_IE | 11 | 10 | c | t | 6 | en_IE.utf8 | en_IE.utf8 | 12358 | en_NZ | 11 | 10 | c | t | 6 | en_NZ.utf8 | en_NZ.utf8 | 12359 | en_PH | 11 | 10 | c | t | 6 | en_PH.utf8 | en_PH.utf8 | 12360 | en_SG | 11 | 10 | c | t | 6 | en_SG.utf8 | en_SG.utf8 | 12361 | en_US | 11 | 10 | c | t | 6 | en_US.utf8 | en_US.utf8 | 12362 | en_ZA | 11 | 10 | c | t | 6 | en_ZA.utf8 | en_ZA.utf8 | 12363 | en_ZW | 11 | 10 | c | t | 6 | en_ZW.utf8 | en_ZW.utf8 |

ogorun commented 3 years ago

Thank you

liuwang commented 2 years ago

Did you solve this problem?

frwyatt commented 2 years ago

I also had this issue installing to Linux. I was able to resolve it by changing the following line in all of the .sql files

CREATE DATABASE openstreetmap_test WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 'English_United States.1252' LC_CTYPE = 'English_United States.1252';

I changed this to:

CREATE DATABASE openstreetmap_test WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 'en_GB.utf8' LC_CTYPE = 'en_GB.utf8';

where I found the encoding used to create the initial postgres database, which in my case was 'en_GB.utf8'. Once I had made this change (and changed the user name to my username following the install instructions) the installation worked.