openvenues / jpostal

Java/JNI bindings to libpostal for for fast international street address parsing/normalization
MIT License
105 stars 42 forks source link

Expander and Parser not in java.library.path #6

Closed gianvi closed 8 years ago

gianvi commented 8 years ago

Hi, I'm trying to execute the package but I got this errors (after a lot of work to build and compile everything from libpostal!!). What I've done is:

Hello JPostal! Exception in thread "main" java.lang.UnsatisfiedLinkError: no jpostal_expander in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867) at java.lang.Runtime.loadLibrary0(Runtime.java:870) at java.lang.System.loadLibrary(System.java:1122) at MerchantBuilder.examples.AddressExpander.(AddressExpander.java:7) at MerchantBuilder.TestJPostal.main(TestJPostal.java:24) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

Please can u help me with this? I really want to explore the possibilities of LibPostal in NLP workflows...but it's really hard to set up all the environment, cause to the fact there is no sbt/maven inclusion!! Thanks

albarrentine commented 8 years ago

See issue #3. For Java with native modules the .so/.jniLib files built by the JNI extension itself need to be on java.library.path. They can be found in build/natives after running gradle assemble. This Gradle plugin might be helpful: https://github.com/cjstehno/coffeaelectronica/wiki/Going-Native-with-Gradle

gianvi commented 8 years ago

Tnx @thatdatabaseguy...now it's correctly working! Btw, parser and expander are awesome and... I'm deal now with the "language" lack. There is an automatic selection, or some specific module/strategy to specify it ? There are some functional API? ...sorry for all these questions but is very difficult to find docs on it ...I'm trying to write an entity matcher in spark, in scala and I use a very noised db...with 70% of italian places and I was looking for the other functionalities and proeperty but is not simple to understand how for example the language can be choosed...

albarrentine commented 8 years ago

Bravissimo.

For expansion, if you know the language is Italian a priori and don't want to use the automatic language classifier, you can specify to use only the Italian dictionaries as follows:

String address = "V. Benedetta, no. 25 00153 Roma";
String[] languages = {"it"};
ExpanderOptions englishOptions = new ExpanderOptions.Builder().languages(languages).build();
AddressExpander expander = AddressExpander.getInstance();
String[] expansions = expander.expandAddressWithOptions(address, options);

Note that you may want to use the automatic language classification anyway if some of the addresses are in Valle d'Aosta or Trentino-Alto Adige because some of the street names could be in French or German respectively, at least in the data sets I've seen.

For parsing, I'd need to look at the specific mistakes it's making. The current version of libpostal is trained on ~2.7M addresses in Italy. There's a new parser being developed that's trained on > 3M Italian addresses as well as simple place queries which help with parsing most of the tiny "località" that one might see in more rural addresses. The new parser also randomly appends sub-building information to the addresses so it can parse phrases like "pº 2" or "sala 123" as well as "casella postale" addresses. You may be interested in checking that out when it's released into master. The parser will probably never achieve 100% accuracy, but the next release is a major improvement nonetheless.

gianvi commented 8 years ago

HI dbguy, and thanks for support. Try to ask u some more questions about libpostal: as I told u I'm working with it to do an entity matcher in spark (that I hope to release soon :) )...btw I'm using a DB with entity addressed and featurized with very noise and sparsed data, an example "row" can be like this: | shopid | name| cap | city | indir|concat(indir, ' ', city, ' ', cap)| expanded_address| |27e0d398-c1a6-...|Toorr | 8001|Z�rich| | Z�rich 8001 | z�rich 8001|

Now after expansion and parsification (and other features extraction techs.) with libpostal I obtain this... | shopid| ... |test_house|test_house_number| test_road|test_neighbourhood|test_suburb|test_postcode|test_city|test_state|test_country|test_country_code|test_city_district|test_state_district|test_altro|

|27e0d398-c1a6-421...| Z�rich 8001| z�rich 8001|[z�rich,null,null...| z�rich| null| null| null| null| 8001| null| null| null| null| null| null| null| gucci|

Well...here the problem is that Zurich (the city) is recognized as Road. Now I'm wondering if I missing some part before parsification (I'm a sure of this cause I'm still dont know very well libpostal internallly) and so my first question is: 1) there is in libpostal some dictionary as cities or something like that (in addition to the national word street dictionaries that I already looked up in libpostal)? 2) How to code for example parsification with given "city" or given nationality? (I suppose there is a way to specifiy input "compenent_part" already known)

Thanks in advance for support and as I asked u before, any suggestions, link , api or docs on libpostal are very appreciate!

albarrentine commented 8 years ago

Hm, for some reason I'm seeing very few postcodes in the final training data for Switzerland, though they definitely exist in OpenStreetMap. That's likely the source of the problem i.e. as far as the model knows from what it's been given, "Zurich" followed by a number is actually more likely to be a road than a city + postcode (Uruguay, Spain, etc. where it's technically "Calle Zurich" but the "Calle" is usually omitted).

I'll look into this for the next parser release. Probably something simple. It might be affecting some other countries as well, though I've personally spot-checked the training data in 30-40 countries and this is the first time I've seen this issue.

To quickly answer your two questions.

  1. yes, but it's not a static dictionary. We derive gazetteers dynamically from the training data to help classify place names (anything from suburb up to country) and construct sense inventories for ambiguous names. These dictionaries are used to inform the parser, but again because we don't want to label the "Italia" in every "Via Italia" as the country, the dictionaries are not the only variables used to make a prediction. Others include left and right context, the model's own previous predictions, etc.
  2. If I understand correctly, (se preferisci scrivermi in italiano, prego), you're asking if it's possible to control what the parser returns when the city or country, etc. is already known, and the answer is no, it's not. I initially thought that it would be helpful to be able to specify country and language during parsing, but it actually made the model perform worse overall. Intuitively this makes sense. If you introduce country-specific models, a country like Switzerland which has several official non-endemic languages will not be able to share parameters/weights with e.g. the Italy model, so Italian addresses in Switzerland would be very sparse and the parser would get them wrong more often than the more commonly-used French or German. Language-specific models are a little better, but again perform worse than the global model, probably because the words themselves do enough to distinguish languages and there are other parameters like tag sequences that can be shared even by countries that don't have a lot of data. I suppose specifying known components could be possible if correctness guarantees are needed, but I also think users who really need that can just remove the known components themselves or rename them after parsing. The idea is, libpostal and machine parsing generally may not be perfect, but it's certainly better than the awful regexes people were writing before.
ibikhan1234 commented 6 years ago

Hi i need this library working in my java program but it has generated the following error:

run: Exception in thread "main" java.lang.UnsatisfiedLinkError: no jpostal_expander in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867) at java.lang.Runtime.loadLibrary0(Runtime.java:870) at java.lang.System.loadLibrary(System.java:1122) at com.mapzen.jpostal.AddressExpander.(AddressExpander.java:7) at addressparsing.AddressParsing.main(AddressParsing.java:18) C:\Users\user\AppData\Local\NetBeans\Cache\8.2\executor-snippets\run.xml:53: Java returned: 1 BUILD FAILED (total time: 2 seconds)

albarrentine commented 6 years ago

See issue #3. The .so/.jniLib files built by the JNI extension also need to be on java.library.path.

ibikhan1234 commented 6 years ago

Where can i get that extention?

ibikhan1234 commented 6 years ago

I am running this java program in Netbeans in Windows platform, not in linux. build.gradle contains

task buildJniLib(type:Exec) { commandLine './build.sh' } where ./build.sh is only availbale in Linux

albarrentine commented 6 years ago

libpostal only recently added Windows support, and there are a few conditions currently. You can now build the C library (https://github.com/openvenues/libpostal) with MSYS2/MinGW64 or WSL, but not sure about other environments/compilers. That should work for the JNI extensions here as well, since it's also an autotools build.

@AeroXuk has been working on Windows support for the C library as well as the C# bindings, and can likely answer most questions related to running libpostal on Windows.

ibikhan1234 commented 6 years ago

////////////////// MAIN CLASS ///////////////////////// package com.mapzen.jpostal; public class Library { public static void main(String args[]){ AddressParser p = AddressParser.getInstance(); ParsedComponent[] components = p.parseAddress("The Book Club 100-106 Leonard St, Shoreditch, London, Greater London, EC2A 4RH, United Kingdom");

    for (ParsedComponent c : components) {
        System.out.printf("%s: %s\n", c.getLabel(), c.getValue());  
}

}} ///////////////////////////// BUILD.GRADLE /////////////////////////// apply plugin: 'application'

repositories { mavenCentral() }

task buildJniLib(type:Exec) { commandLine './build.sh' }

compileJava.dependsOn(buildJniLib)

dependencies { testCompile 'junit:junit:4.+' }

tasks.withType(Test) { systemProperty "java.library.path", "src/main/jniLibs" } mainClassName = "com.mapzen.jpostal.Library"

////////////////////////////////////ERROR GENERATED/////////////////////////////////////////////

Exception in thread "main" java.lang.UnsatisfiedLinkError: no jpostal_parser in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867) at java.lang.Runtime.loadLibrary0(Runtime.java:870) at java.lang.System.loadLibrary(System.java:1122) at com.mapzen.jpostal.AddressParser.(AddressParser.java:8) at com.mapzen.jpostal.Library.main(Library.java:13)

//////////////////////////////// src/main/jniLibs FOLDER CONTAINS //////////////////////////////// 1- libpostal_expander --------> with extenstions (.so, .la, .so.0 and .so.0.0.0) 2- libpostal_parser --------> with extenstions (.so, .la, .so.0 and .so.0.0.0)

//////////////////////////////////////MY PATH VARIABLE ////////////////////////////////////////////

$PKG_CONFIG_PATH /usr/local/lib/pkgconfig/libpostal.pc

PLEASE SOMEONE HELP ME RESOLVE THIS PROBLEM