osm-search / Nominatim

Open Source search based on OpenStreetMap data
https://nominatim.org
GNU General Public License v3.0
2.99k stars 701 forks source link

Consider using libpostal #1119

Open otbutz opened 5 years ago

otbutz commented 5 years ago

Maybe it's worth to evaluate if an optional libpostal integration could improve search results. Talking to libpostal itself might be a bit too low level but we could use the same strategy as Pelias and call a REST wrapper: https://github.com/whosonfirst/go-whosonfirst-libpostal#wof-libpostal-server

This would probably solve issues like #759

Alex2782 commented 5 years ago

I tried libpostal a few weeks ago, with some "German input"-combinations.

Input "[street] [house_number] [city]" was right, i thought amazing! But "[city] [street] [house_number]" was wrong, [city] was detected as "poi-name"

otbutz commented 5 years ago

I'd sure expect bugs but those should be fixed by the libpostal project and not be treated by Nominatim.

Alex2782 commented 5 years ago

https://github.com/openvenues/php-postal Maybe that works too, like Go Service? Because Nominatim is PHP (i will try it in 2-3 weeks, maybe)

https://pelias.io/index.html

Libpostal: Pelias uses the libpostal project for parsing addresses using the power of machine learning. Originally we loaded the 2GB of libpostal data directly in the API service, but this makes scaling harder and causes the API to take about 30 seconds to start, instead of a few milliseconds. We use a Go service built by the Who's on First team to make this happen quickly and efficiently.

https://github.com/openvenues/libpostal/issues/314

In general, customization/retraining is not a major goal of the project. The focus is on improving the common library for everyone instead of having lots of custom-trained models.

This is a big disadvantage for me, if you want to use Nominatim only with certain countries.

otbutz commented 5 years ago

https://github.com/openvenues/php-postal Maybe that works too, like Go Service? Because Nominatim is PHP (i will try it in 2-3 weeks, maybe)

True but you would be limited to libpostal installed on the same server.

Possible problems:

In general, customization/retraining is not a major goal of the project. The focus is on improving the common library for everyone instead of having lots of custom-trained models.

This is a big disadvantage for me, if you want to use Nominatim only with certain countries.

Apart from memory consumption i don't see a problem here. It's better to rely on a general model which is properly tested instead of using an error prone specialized one.

Alex2782 commented 5 years ago

"wof-libpostal-server" or "a complex Docker" is not required, only 2 GB more RAM on same server

  1. https://github.com/openvenues/libpostal
  2. https://github.com/openvenues/php-postal

I had problems with PKG under CentOS7 pkg-config --cflags --libs libpostal

Environment variable was necessary export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

then activate the extension in the /etc/php.ini or /etc/php.d/postal.ini and restart sudo systemctl restart httpd only the Httpd start takes longer than usual, php-postal answers instantly


some tests with Postal\Parser::parse_address( {string_input} )

1132

Nordstraße 5, 27476 Cuxhaven
Nordstraße 3, 27476, Cuxhaven
Nordstraße 3 27476, Cuxhaven

output

1 road = nordstraße house_number = 5 postcode = 27476 city = cuxhaven

2 and 3 road = nordstraße house_number = 3 postcode = 27476 city = cuxhaven

for structured search ? https://nominatim.openstreetmap.org/search.php?&street=Nordstra%C3%9Fe+3&city=Cuxhaven&postalcode=27476

but i dont know yet how it can solve issues like #759

gopi-ar commented 5 years ago

The libpostal php library doesn't scale well because each fpm-worker / apache mod-php instance will load all the libpostal data; so it's 2gb per worker. Typically servers have 10 to 100s of workers.

Instead you should look at an HTTP call to the golang libpostal worker.

However, structured search in Nominatim is still experimental and in most cases ?q= fares better so libpostal's value addition is limited.

otbutz commented 5 years ago

for structured search ?

That was my intention.

but i dont know yet how it can solve issues like #759

Maybe not spelling issues but it could help with certain omissions/abbreviations/additions which are not or not really well handled by Nominatim itself.

These two articles explain the benefits of libpostal quite good: https://machinelearnings.co/statistical-nlp-on-openstreetmap-b9d573e6cc86 https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-part-2-80405b988718

Alex2782 commented 5 years ago

The libpostal php library doesn't scale well because each fpm-worker / apache mod-php instance will load all the libpostal data; so it's 2gb per worker. Typically servers have 10 to 100s of workers.

Ok thanks, on my CentOS7-VM with 16 GB RAM.

free -h
              total        used        free      shared  buff/cache   available
Mem:            15G        3,0G        1,8G        2,1G         10G         10G
Swap:          127M        127M          8K

and htop (idle)

I think no problem for us, we have maybe sometimes 10 users at same time. Check load tests below.

htop

Soap-UI, php-postal / httpd Load-Test -> 100 request every second (screenshots below)

also 5000 request every second no problems

  1. avg = 50 ms response time (+30ms longer)
  2. more httpd processes (worker?)
  3. CPU at 10-50 % Load
  4. same RAM 5G / 15G, i don't know why no differents with more workers / requests

sopaui-test

htop_2

otbutz commented 5 years ago

It is designed as a shared PHP extension. Maybe it's only loaded once on apache startup? You should check if and how much it affects apache service startup time.

Alex2782 commented 5 years ago

yes longer startup time, my post from yesterday

.... and restart sudo systemctl restart httpd only the Httpd start takes longer than usual, php-postal answers instantly

otbutz commented 5 years ago

That would be acceptable IMHO. @lonvia what do you think about optional libpostal integration via https://github.com/openvenues/php-postal

Alex2782 commented 5 years ago

we will try libpostal with nominatim (only German-OSM-Data) and i can post our experience later

gopi-ar

However, structured search in Nominatim is still experimental and in most cases ?q= fares better so libpostal's value addition is limited.

https://wiki.openstreetmap.org/wiki/Nominatim

(Commas are optional, but improve performance by reducing the complexity of the search.)

street= [housenumber] [streetname] city=[city] county=[county] state=[state] country=[country] postalcode=[postalcode]

https://nominatim.openstreetmap.org/search.php?&street=3%20Nordstra%C3%9Fe&city=Cuxhaven&postalcode=27476&country=Deutschland

Input = "3 Nordstraße, Cuxhaven, 27476, Deutschland" "structured search"-params initialized "q"-param to "[housenumber] [streetname], [city], [postalcode], [country]"

lonvia commented 5 years ago

If you want to use libpostal with Nominatim in this way, you should replace the entire mechanism that creates interpretations of the search query. That means creating one or more SearchDescription objects from the libpostal output, calling query() on it and then filter and rank the results appropriately.

powerbilayeredmap commented 5 years ago

Pelias uses libpostal and it doesn't work right all the time. They are currently investigating how to bypass it in some cases.

https://github.com/pelias/pelias/issues/766

arungowtham commented 5 years ago

https://info.crunchydata.com/blog/quick-and-dirty-address-matching-with-libpostal https://github.com/pramsey/pgsql-postal

Libpostal as a postgres extension.

ghost commented 1 year ago

Libpostal integration would be a very good addition, hopefully it can make it to 5.0.0.

rjurney commented 1 month ago

@Alex2782 Regarding not retraining the model:

This is a big disadvantage for me, if you want to use Nominatim only with certain countries.

Senzing released a greatly improved model and test dataset.

https://senzing.com/new-libpostal-data-model-from-senzing/