openvenues / jpostal

Java/JNI bindings to libpostal for for fast international street address parsing/normalization
MIT License
106 stars 44 forks source link

Setup Error: Error loading transliteration module #13

Closed TheLegend29 closed 7 years ago

TheLegend29 commented 7 years ago

Trying to install on the following : Linux HP-Pavilion-15-Notebook-PC 4.8.0-41-generic #44~16.04.1-Ubuntu SMP Fri Mar 3 17:11:16 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

When I try to check ./src/address_parser I get the following error: Loading models... ERR Error loading transliteration module, dir=(null) at libpostal_setup_datadir (libpostal.c:1069) errno: No such file or directory

The following are the steps I've used:

cd ~
rm -rf libpostal
git clone https://github.com/openvenues/libpostal
cd libpostal
./bootstrap.sh
./configure LDFLAGS=-L/usr/lib64 --datadir=$(pwd)/data --prefix=$(realpath $(pwd)) --bindir=$(realpath $(pwd)/bin)
make install
sudo ldconfig
./src/address_parser

Any guidance would be great. I'm a newbie ubuntu user but would love to check this module out for python.

albarrentine commented 7 years ago

That sounds like the data files didn't download. Does the dir you specified for --datadir when running configure have enough space? It needs approximately 2.2G free to build the current version of libpostal (that number may change in subsequent updates to the models).

Also, looking at the "dir=(null)" piece, which prints the value of LIBPOSTAL_DATA_DIR on error, it's possible that something went wrong in configure. Did all of those commands run successfully?

Another random thing I noticed: it looks like $(realpath $(pwd)) is used for the other paths but only $(pwd) is used --datadir, if that has anything to do with it (it shouldn't, pwd returns an absolute path already AFAIK but if it doesn't on your system that might be it).

TheLegend29 commented 7 years ago

According to the properties there is 682gb of free space. Is there a better way you'd suggest to check this? If I try to put realpath in the datadir like in the following : ./configure LDFLAGS=-L/usr/lib64 --datadir=$(realpath $(pwd)) /data --prefix=$(realpath $(pwd)) --bindir=$(realpath $(pwd)/bin) I get a config fail with the output looking like this: configure: WARNING: you should use --build, --host, --target configure: WARNING: invalid host type: /data checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... /bin/mkdir -p checking for gawk... no checking for mawk... mawk checking whether make sets $(MAKE)... yes checking whether make supports nested variables... yes checking build system type... Invalid configuration /data': machine/data' not recognized configure: error: /bin/bash ./config.sub /data failed

When running configure using the original command above I get the following output, which seems successful to me?: checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... /bin/mkdir -p checking for gawk... no checking for mawk... mawk checking whether make sets $(MAKE)... yes checking whether make supports nested variables... yes checking build system type... x86_64-pc-linux-gnu checking host system type... x86_64-pc-linux-gnu checking how to print strings... printf checking for style of include used by make... GNU checking for gcc... gcc checking whether the C compiler works... yes checking for C compiler default output file name... a.out checking for suffix of executables... checking whether we are cross compiling... no checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ISO C89... none needed checking whether gcc understands -c and -o together... yes checking dependency style of gcc... gcc3 checking for a sed that does not truncate output... /bin/sed checking for grep that handles long lines and -e... /bin/grep checking for egrep... /bin/grep -E checking for fgrep... /bin/grep -F checking for ld used by gcc... /usr/bin/ld checking if the linker (/usr/bin/ld) is GNU ld... yes checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B checking the name lister (/usr/bin/nm -B) interface... BSD nm checking whether ln -s works... yes checking the maximum length of command line arguments... 1572864 checking how to convert x86_64-pc-linux-gnu file names to x86_64-pc-linux-gnu format... func_convert_file_noop checking how to convert x86_64-pc-linux-gnu file names to toolchain format... func_convert_file_noop checking for /usr/bin/ld option to reload object files... -r checking for objdump... objdump checking how to recognize dependent libraries... pass_all checking for dlltool... no checking how to associate runtime and link libraries... printf %s\n checking for ar... ar checking for archiver @FILE support... @ checking for strip... strip checking for ranlib... ranlib checking command to parse /usr/bin/nm -B output from gcc object... ok checking for sysroot... no checking for a working dd... /bin/dd checking how to truncate binary pipes... /bin/dd bs=4096 count=1 checking for mt... mt checking if mt is a manifest tool... no checking how to run the C preprocessor... gcc -E checking for ANSI C header files... yes checking for sys/types.h... yes checking for sys/stat.h... yes checking for stdlib.h... yes checking for string.h... yes checking for memory.h... yes checking for strings.h... yes checking for inttypes.h... yes checking for stdint.h... yes checking for unistd.h... yes checking for dlfcn.h... yes checking for objdir... .libs checking if gcc supports -fno-rtti -fno-exceptions... no checking for gcc option to produce PIC... -fPIC -DPIC checking if gcc PIC flag -fPIC -DPIC works... yes checking if gcc static flag -static works... yes checking if gcc supports -c -o file.o... yes checking if gcc supports -c -o file.o... (cached) yes checking whether the gcc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes checking whether -lc should be explicitly linked in... no checking dynamic linker characteristics... GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking whether stripping libraries is possible... yes checking if libtool supports shared libraries... yes checking whether to build shared libraries... yes checking whether to build static libraries... yes checking for gcc option to accept ISO C99... none needed checking for library containing snappy_compress... -lsnappy checking for library containing log... -lm checking for ANSI C header files... (cached) yes checking whether time.h and sys/time.h may both be included... yes checking for dirent.h that defines DIR... yes checking for library containing opendir... none required checking for stdbool.h that conforms to C99... yes checking for _Bool... yes checking fcntl.h usability... yes checking fcntl.h presence... yes checking for fcntl.h... yes checking float.h usability... yes checking float.h presence... yes checking for float.h... yes checking for inttypes.h... (cached) yes checking limits.h usability... yes checking limits.h presence... yes checking for limits.h... yes checking locale.h usability... yes checking locale.h presence... yes checking for locale.h... yes checking malloc.h usability... yes checking malloc.h presence... yes checking for malloc.h... yes checking for memory.h... (cached) yes checking stddef.h usability... yes checking stddef.h presence... yes checking for stddef.h... yes checking for stdint.h... (cached) yes checking for stdlib.h... (cached) yes checking for string.h... (cached) yes checking for unistd.h... (cached) yes checking for inline... inline checking for int16_t... yes checking for int32_t... yes checking for int64_t... yes checking for int8_t... yes checking for off_t... yes checking for size_t... yes checking for ssize_t... yes checking for uint16_t... yes checking for uint32_t... yes checking for uint64_t... yes checking for uint8_t... yes checking for ptrdiff_t... yes checking for stdlib.h... (cached) yes checking for unistd.h... (cached) yes checking for sys/param.h... yes checking for getpagesize... yes checking for working mmap... yes checking for malloc... yes checking for realloc... yes checking for getcwd... yes checking for gettimeofday... yes checking for memmove... yes checking for memset... yes checking for munmap... yes checking for regcomp... yes checking for setlocale... yes checking for sqrt... yes checking for strdup... yes checking for strndup... yes checking for shuf... yes configure: extra cflags for scanner.c: checking that generated files are newer than configure... done configure: creating ./config.status config.status: creating Makefile config.status: creating libpostal.pc config.status: creating src/Makefile config.status: creating src/sparkey/Makefile config.status: creating test/Makefile config.status: creating config.h config.status: executing depfiles commands config.status: executing libtool commands

Lastly, in the make install I noticed a few lines. Are these normal to get? In file included from collections.h:8:0, from averaged_perceptron.h:26, from address_parser.h:49, from address_parser.c:1: address_parser.c: In function ‘address_parser_context_fill’: log/log.h:26:55: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘int64_t {aka long int}’ [-Wformat=]

define log_debug(M, ...) do { if (0) fprintf(stderr, "\33[34mDEBUG\33[39m " M " \33[90m at %s (%s:%d) \33[39m\n", ##__VA_ARGS, func_,

                                                   ^

address_parser.c:329:17: note: in expansion of macro ‘log_debug’ log_debug("token i=%lld, null phrase membership\n", i); ^ log/log.h:26:55: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘int64_t {aka long int}’ [-Wformat=]

define log_debug(M, ...) do { if (0) fprintf(stderr, "\33[34mDEBUG\33[39m " M " \33[90m at %s (%s:%d) \33[39m\n", ##__VA_ARGS, func_,

                                                   ^

address_parser.c:333:17: note: in expansion of macro ‘log_debug’ log_debug("token i=%lld, phrase membership=%lld\n", i, j); ^ log/log.h:26:55: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 4 has type ‘int64_t {aka long int}’ [-Wformat=]

define log_debug(M, ...) do { if (0) fprintf(stderr, "\33[34mDEBUG\33[39m " M " \33[90m at %s (%s:%d) \33[39m\n", ##__VA_ARGS, func_,

                                                   ^

address_parser.c:333:17: note: in expansion of macro ‘log_debug’ log_debug("token i=%lld, phrase membership=%lld\n", i, j); ^ log/log.h:26:55: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘int64_t {aka long int}’ [-Wformat=]

define log_debug(M, ...) do { if (0) fprintf(stderr, "\33[34mDEBUG\33[39m " M " \33[90m at %s (%s:%d) \33[39m\n", ##__VA_ARGS, func_,

                                                   ^

address_parser.c:340:9: note: in expansion of macro ‘log_debug’ log_debug("token i=%lld, null phrase membership\n", i); ^ log/log.h:26:55: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘int64_t {aka long int}’ [-Wformat=]

define log_debug(M, ...) do { if (0) fprintf(stderr, "\33[34mDEBUG\33[39m " M " \33[90m at %s (%s:%d) \33[39m\n", ##__VA_ARGS, func_,

                                                   ^

address_parser.c:356:17: note: in expansion of macro ‘log_debug’ log_debug("token i=%lld, null geo phrase membership\n", i); ^ log/log.h:26:55: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘int64_t {aka long int}’ [-Wformat=]

define log_debug(M, ...) do { if (0) fprintf(stderr, "\33[34mDEBUG\33[39m " M " \33[90m at %s (%s:%d) \33[39m\n", ##__VA_ARGS, func_,

                                                   ^

address_parser.c:361:17: note: in expansion of macro ‘log_debug’ log_debug("token i=%lld, geo phrase membership=%lld\n", i, j); ^ log/log.h:26:55: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 4 has type ‘int64_t {aka long int}’ [-Wformat=]

define log_debug(M, ...) do { if (0) fprintf(stderr, "\33[34mDEBUG\33[39m " M " \33[90m at %s (%s:%d) \33[39m\n", ##__VA_ARGS, func_,

                                                   ^

address_parser.c:361:17: note: in expansion of macro ‘log_debug’ log_debug("token i=%lld, geo phrase membership=%lld\n", i, j); ^ log/log.h:26:55: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘int64_t {aka long int}’ [-Wformat=]

define log_debug(M, ...) do { if (0) fprintf(stderr, "\33[34mDEBUG\33[39m " M " \33[90m at %s (%s:%d) \33[39m\n", ##__VA_ARGS, func_,

                                                   ^

address_parser.c:367:9: note: in expansion of macro ‘log_debug’ log_debug("token i=%lld, null geo phrase membership\n", i); ^ log/log.h:26:55: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘int64_t {aka long int}’ [-Wformat=]

define log_debug(M, ...) do { if (0) fprintf(stderr, "\33[34mDEBUG\33[39m " M " \33[90m at %s (%s:%d) \33[39m\n", ##__VA_ARGS, func_,

                                                   ^

address_parser.c:383:17: note: in expansion of macro ‘log_debug’ log_debug("token i=%lld, null component phrase membership\n", i); ^ log/log.h:26:55: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘int64_t {aka long int}’ [-Wformat=]

define log_debug(M, ...) do { if (0) fprintf(stderr, "\33[34mDEBUG\33[39m " M " \33[90m at %s (%s:%d) \33[39m\n", ##__VA_ARGS, func_,

                                                   ^

address_parser.c:388:17: note: in expansion of macro ‘log_debug’ log_debug("token i=%lld, component phrase membership=%lld\n", i, j); ^ log/log.h:26:55: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 4 has type ‘int64_t {aka long int}’ [-Wformat=]

define log_debug(M, ...) do { if (0) fprintf(stderr, "\33[34mDEBUG\33[39m " M " \33[90m at %s (%s:%d) \33[39m\n", ##__VA_ARGS, func_,

                                                   ^

address_parser.c:388:17: note: in expansion of macro ‘log_debug’ log_debug("token i=%lld, component phrase membership=%lld\n", i, j); ^ log/log.h:26:55: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘int64_t {aka long int}’ [-Wformat=]

define log_debug(M, ...) do { if (0) fprintf(stderr, "\33[34mDEBUG\33[39m " M " \33[90m at %s (%s:%d) \33[39m\n", ##__VA_ARGS, func_,

                                                   ^

address_parser.c:394:9: note: in expansion of macro ‘log_debug’ log_debug("token i=%lld, null component phrase membership\n", i); gcc -DHAVE_CONFIG_H -I.. -I/usr/local/include -Wfloat-equal -Wpointer-arith -DLIBPOSTAL_DATA_DIR='"/home/omer/libpostal/data/libpostal"' -g -g -O2 -O3 -MT build_numex_table-numex.o -MD -MP -MF .deps/build_numex_table-numex.Tpo -c -o build_numex_table-numex.o test -f 'numex.c' || echo './'numex.c In file included from collections.h:8:0, from numex.h:14, from numex.c:3: numex.c: In function ‘numex_table_read’: log/log.h:26:55: warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘uint64_t {aka long unsigned int}’ [-Wformat=]

define log_debug(M, ...) do { if (0) fprintf(stderr, "\33[34mDEBUG\33[39m " M " \33[90m at %s (%s:%d) \33[39m\n", ##__VA_ARGS, func_,

                                                   ^

numex.c:424:5: note: in expansion of macro ‘log_debug’ log_debug("read num_languages = %llu\n", num_languages); ^ log/log.h:26:55: warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘uint64_t {aka long unsigned int}’ [-Wformat=]

define log_debug(M, ...) do { if (0) fprintf(stderr, "\33[34mDEBUG\33[39m " M " \33[90m at %s (%s:%d) \33[39m\n", ##__VA_ARGS, func_,

                                                   ^

numex.c:446:5: note: in expansion of macro ‘log_debug’ log_debug("read num_rules = %llu\n", num_rules);

albarrentine commented 7 years ago

There's an extra space in --datadir=$(realpath $(pwd)) /data, I think you want: --datadir=$(realpath $(pwd))/data instead.

It looks like everything compiled properly. The main thing to check is: du -h $(realpath $(pwd))/data and make sure that it's 2.2G. If not, try rm-ing the data dir and run make again. Toward the end of the make command there should be a line like ./libpostal_data download all $YOUR_DATA_DIR. If something went wrong when downloading the data files, that's where to look. If all else fails, nuke the entire checkout and start from a fresh clone.

TheLegend29 commented 7 years ago

I think you're onto something with the 2.2G it's stuck on producing the below output before and after. I've also tried the whole process from scratch a few times with the same outcomes (the added realpath doesn't appear to make a difference). Not sure why the 2.2G isn't downloading there's no folder limit or read/write issues associated with the datadir. Sorry for the lengthy back and forth trying to get this to work. 8.0K /home/user/libpostal/data/address_parser 20K /home/user/libpostal/data/libpostal 8.0K /home/user/libpostal/data/geonames 8.0K /home/user/libpostal/data/transliteration 8.0K /home/user/libpostal/data/numex 8.0K /home/user/libpostal/data/geodb 8.0K /home/user/libpostal/data/address_expansions 72K /home/user/libpostal/data

albarrentine commented 7 years ago

Try running ./src/libpostal_data download all $(pwd)/data and see what happens.

TheLegend29 commented 7 years ago

I get the following output -- aren't these the files that should be in data not in source?

Checking for new libpostal data file...
./src/libpostal_data: 108: ./src/libpostal_data: curl: not found
libpostal data file up to date
Checking for new libpostal geodb data file...
./src/libpostal_data: 108: ./src/libpostal_data: curl: not found
libpostal geodb data file up to date
Checking for new libpostal parser data file...
./src/libpostal_data: 108: ./src/libpostal_data: curl: not found
libpostal parser data file up to date
Checking for new libpostal language classifier data file...
./src/libpostal_data: 108: ./src/libpostal_data: curl: not found
libpostal language classifier data file up to date

looks like it's saving in src, looking at the size is this the expected behavior?

12K ./src/murmur
44K ./src/utf8proc/.deps
516K    ./src/utf8proc/.libs
7.2M    ./src/utf8proc
96K ./src/sparkey/.deps
752K    ./src/sparkey/.libs
1.5M    ./src/sparkey
20K ./src/cmp/.deps
188K    ./src/cmp/.libs
1.5M    ./src/cmp
8.0K    ./src/log
20K ./src/geohash/.deps
52K ./src/geohash/.libs
320K    ./src/geohash
64K ./src/klib
12K ./src/linenoise/.deps
156K    ./src/linenoise
1.6M    ./src/.deps
42M ./src/.libs
230M    ./src/
albarrentine commented 7 years ago

curl is not installed. apt-get install curl, delete the datadir, and run make again.

TheLegend29 commented 7 years ago

oh wow, that was a silly oversight, now it works. That likely means this problem is probably also silly but what would be the reason now when I try to bring it into python I get a similar error. I've done the following:

cd libpostal/scripts python3 setup.py build_ext --inplace cd /home/user/.local/lib/python3.5/site-packages/ nosetests postal/tests

I get the same kind of error (dir=null) (both in the nose test and if i try to import in python)

ERR   Error loading transliteration module, dir=(null)
   at libpostal_setup_datadir (libpostal.c:1069) errno: No such file or directory
ERR   Error loading transliteration module, dir=(null)
   at libpostal_setup_datadir (libpostal.c:1069) errno: No such file or directory
EE
======================================================================
ERROR: Failure: SystemError (initialization of _expand raised unreported exception)
albarrentine commented 7 years ago

Those are not the Python bindings, these are. Try: pip3 install postal

TheLegend29 commented 7 years ago

yup so that's what I did before the above. However, when I ran in python I end up with the following error. i thought it was an error like the one we just solved where the libpostal didn't get put on properly. But now I'm confused since it works outside of python.

from postal.expand import expand_address ERR Error loading transliteration module, dir=(null) at libpostal_setup_datadir (libpostal.c:1069) errno: No such file or directory Traceback (most recent call last): File "", line 1, in File "/home/omer/.local/lib/python3.5/site-packages/postal/expand.py", line 5, in from postal import _expand SystemError: initialization of _expand raised unreported exception

albarrentine commented 7 years ago

It seems like there were some non-standard steps taken during the install, so it may be in a weird half-installed state. Try a clean build now that curl is installed and the data files can be downloaded using the exact steps from the README for Ubuntu.