openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.07k stars 419 forks source link

Parser setup fails with Docker on Windows #175

Closed BenK10 closed 7 years ago

BenK10 commented 7 years ago

Since I'm running a Windows machine, I've been using libpostal inside a Docker container with a Debian image for the last few weeks. This worked fine until I decided to rebuild the image with the latest libpostal release on Friday April 7th, 2017. Now when I run the address parser command line tool, I get the following error: could not find parser model file of known type at address_parser_load (address_parser.c:208) errno: no such file or directory

it does not say which file is missing.

albarrentine commented 7 years ago

Is this the very latest checkout? I added a commit recently to help with upgrades.

Sounds like the model files didn't download. Ensure that the datadir you specified during configure has at least 1.8GB of disk space free. When you ran make it should have downloaded the new model files. If that didn't happen, try removing the previous datadir and running make again.

BenK10 commented 7 years ago

To add a little more detail, I use a Dockerfile to build the latest libpostal inside a Docker container. It clones the latest libpostal release from GitHub and then runs a script invoking make:

FROM ubuntu:16.04
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y \
    curl autoconf automake libtool pkg-config \
    git

WORKDIR /
RUN git clone https://github.com/openvenues/libpostal
WORKDIR /libpostal
COPY ./build_libpostal.sh .
RUN ./build_libpostal.sh

build_libpostal.sh:

#!/usr/bin/env bash
./bootstrap.sh
mkdir -p /opt/libpostal_data
./configure --datadir=/opt/libpostal_data
make
make install
ldconfig

As of April 10, 2017 I'm still having the same issue even with the latest release.

albarrentine commented 7 years ago

Ok, so I just created a Docker container on my Mac, ran those same commands, and could not replicate the issue. Does the container have enough memory and is there sufficient disk space on the machine? The new version should actually require slightly less memory/disk than the previous one, so if it's the same container, that shouldn't be an issue.

If everything's working correctly, toward the end of make, you should see something like:

./libpostal_data download all /opt/libpostal_data/libpostal
Old version of datadir detected, removing...
Checking for new libpostal data file...
New libpostal data file available
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9906k  100 9906k    0     0  5005k      0  0:00:01  0:00:01 --:--:-- 5003k
address_expansions/
address_expansions/address_dictionary.dat
numex/
numex/numex.dat
transliteration/
transliteration/transliteration.dat
Checking for new libpostal parser data file...
New libpostal parser data file available
Downloading multipart: http://libpostal.s3.amazonaws.com/models/address_parser/2017-03-04/parser.tar.gz, size=752483239, num_chunks=11
Downloading part 2: filename=/opt/libpostal_data/libpostal/parser.tar.gz.2, offset=67108864, max=134217727
Downloading part 1: filename=/opt/libpostal_data/libpostal/parser.tar.gz.1, offset=0, max=67108863
Downloading part 3: filename=/opt/libpostal_data/libpostal/parser.tar.gz.3, offset=134217728, max=201326591
Downloading part 4: filename=/opt/libpostal_data/libpostal/parser.tar.gz.4, offset=201326592, max=268435455
Downloading part 5: filename=/opt/libpostal_data/libpostal/parser.tar.gz.5, offset=268435456, max=335544319
Downloading part 6: filename=/opt/libpostal_data/libpostal/parser.tar.gz.6, offset=335544320, max=402653183
Downloading part 7: filename=/opt/libpostal_data/libpostal/parser.tar.gz.7, offset=402653184, max=469762047
Downloading part 8: filename=/opt/libpostal_data/libpostal/parser.tar.gz.8, offset=469762048, max=536870911
Downloading part 9: filename=/opt/libpostal_data/libpostal/parser.tar.gz.9, offset=536870912, max=603979775
Downloading part 10: filename=/opt/libpostal_data/libpostal/parser.tar.gz.10, offset=603979776, max=671088639
Downloading part 11: filename=/opt/libpostal_data/libpostal/parser.tar.gz.11, offset=671088640, max=752483239
address_parser/
address_parser/address_parser_crf.dat
address_parser/address_parser_phrases.dat
address_parser/address_parser_postal_codes.dat
address_parser/address_parser_vocab.trie
Checking for new libpostal language classifier data file...
New libpostal language classifier data file available
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 48.0M  100 48.0M    0     0  7289k      0  0:00:06  0:00:06 --:--:-- 8041k
language_classifier/
language_classifier/language_classifier.dat

If not, the new models weren't downloaded and it's probably related to disk space.

BenK10 commented 7 years ago

The models indeed don't download. Make reports that they are up to date. This is odd because it's a new Docker container with a new image. The models shouldn't even exist.

Should have plenty of disk and memory space. It all worked before.

albarrentine commented 7 years ago

So if the datadir existed, it's plausible that the source dir already existed, in which case I think git clone is basically a no-op. If that's true, you wouldn't have the latest commit which fixes upgrades for existing datadirs from v0. In your Dockerfile you may want to rm -rf /libpostal to be sure that it's a fresh checkout or run a git checkout tags/v1.0.0 after cloning the repo.

albarrentine commented 7 years ago

Still any issues or can this be closed?

BenK10 commented 7 years ago

I'm still working on it. Upon closer examination, I found that the data files actually do download but then there is another execution of libpostal_data download all for them that downloads nothing. If the address parser doesn't work even though the models download, then maybe they are somehow getting clobbered? Here's some output from make with compilation instructions removed:

./libpostal_data download all /opt/libpostal_data/libpostal
Old version of datadir detected, removing...
Checking for new libpostal data file...
New libpostal data file available
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                             Dload  Upload   Total   Spent    Left  Speed
100 9906k  100 9906k    0     0  2707k      0  0:00:03  0:00:03 --:--:-- 2706k
address_expansions/
address_expansions/address_dictionary.dat
numex/
numex/numex.dat
transliteration/
transliteration/transliteration.dat
Checking for new libpostal parser data file...
New libpostal parser data file available
Downloading multipart: http://libpostal.s3.amazonaws.com/models/address_parser/2017-03-04/parser.tar.gz, size=752483239, num_chunks=11
Downloading part 1: filename=/opt/libpostal_data/libpostal/parser.tar.gz.1, offset=0, max=67108863
Downloading part 2: filename=/opt/libpostal_data/libpostal/parser.tar.gz.2, offset=67108864, max=134217727
Downloading part 3: filename=/opt/libpostal_data/libpostal/parser.tar.gz.3, offset=134217728, max=201326591
Downloading part 4: filename=/opt/libpostal_data/libpostal/parser.tar.gz.4, offset=201326592, max=268435455
Downloading part 5: filename=/opt/libpostal_data/libpostal/parser.tar.gz.5, offset=268435456, max=335544319
Downloading part 6: filename=/opt/libpostal_data/libpostal/parser.tar.gz.6, offset=335544320, max=402653183
Downloading part 7: filename=/opt/libpostal_data/libpostal/parser.tar.gz.7, offset=402653184, max=469762047
Downloading part 8: filename=/opt/libpostal_data/libpostal/parser.tar.gz.8, offset=469762048, max=536870911
Downloading part 9: filename=/opt/libpostal_data/libpostal/parser.tar.gz.9, offset=536870912, max=603979775
Downloading part 10: filename=/opt/libpostal_data/libpostal/parser.tar.gz.10, offset=603979776, max=671088639
Downloading part 11: filename=/opt/libpostal_data/libpostal/parser.tar.gz.11, offset=671088640, max=752483239
address_parser/
address_parser/address_parser_crf.dat
address_parser/address_parser_phrases.dat
address_parser/address_parser_postal_codes.dat
address_parser/address_parser_vocab.trie
Checking for new libpostal language classifier data file...
New libpostal language classifier data file available
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 48.0M  100 48.0M    0     0  7757k      0  0:00:06  0:00:06 --:--:-- 10.4M
language_classifier/
language_classifier/language_classifier.dat
make[2]: Leaving directory '/libpostal/src'
Making all in test
make[2]: Entering directory '/libpostal/test'
....
make[2]: Leaving directory '/libpostal/test'
make[2]: Entering directory '/libpostal'
make[2]: Leaving directory '/libpostal'
make[1]: Leaving directory '/libpostal'
Making install in src
make[1]: Entering directory '/libpostal/src'
./libpostal_data download all /opt/libpostal_data/libpostal
Checking for new libpostal data file...
libpostal data file up to date
Checking for new libpostal parser data file...
libpostal parser data file up to date
Checking for new libpostal language classifier data file...
libpostal language classifier data file up to date
make[2]: Entering directory '/libpostal/src'
...
albarrentine commented 7 years ago

Wait, does that mean you tried deleting the source dir and/or checking out the v1.0.0 tag and still got that same error? If so I'm puzzled. Just recreated this entire sequence of events in what is AFAICT an identical Docker container, started with v0.3.4, upgraded to v1.0.0 without issue.

That output's normal, it's just make and make install respectively. Both commands have to run the libpostal_data download command (could just run make install if you're already root - they're separated for people using sudo to avoid permissions issues), but the first time, after the download succeeds, it saves some housekeeping files with the datadir version and the server timestamp of the last model downloaded so any subsequent invocations will not download the files unless the timestamp on the server has changed (i.e. there's an update).

albarrentine commented 7 years ago

If the checkout has the correct commit, the following file should be present after running make: /opt/libpostal_data/libpostal/data_version.

BenK10 commented 7 years ago

I've tried git checkout tags/v1.0.0 but I'm still getting the same problem. The data_version file exists.

albarrentine commented 7 years ago

If that's the case, there should no longer be an address_parser.dat, only address_parser_crf.dat. Is that true?

BenK10 commented 7 years ago

yes, address_parser_crf.dat is in opt/libpostal_data/libpostal/address_parser and there is no address_parser.dat

albarrentine commented 7 years ago

Ok, then there appears to be nothing wrong with the libpostal setup. Since I'm unable to replicate the error using the same Docker environment, and no one else has reported it, I'd suggest just deleting the container and starting fresh.

albarrentine commented 7 years ago

Assuming this can be closed now?

BenK10 commented 7 years ago

I still haven't fixed the problem. Using a fresh container doesn't help. It's interesting that you can't replicate the problem. The one thing I haven't done yet is to run the container on another machine. But I am more concerned now with porting libpostal to Windows.

BenK10 commented 7 years ago

I'm closing this because I now have a Windows build.