Better documentation of what parser is?

blackmad commented 4 years ago

Hey team,

I'm reading through as much of the pelias docs as I can find. I followed a link to pelias/parser and after reading through it (as well as https://geocode.earth/blog/2019/improved-autocomplete-parsing-is-here ) I think there could be some changes to the README that would help make it more understandable. Questions I still have

what is the relationship to libpostal?
what parts of pelias use parser vs libpostal? why?
what is the precision/recall or similar performance of parser vs libpostal?
if parser is meant to be better for autocomplete, why doesn't "111 8th a" guess it's a street? is that on the roadmap?

I would try to update the docs myself but I'm unclear as to the answers.

Best, David

missinglink commented 4 years ago

Good questions!

I think the reason we haven't covered the first three here 🔽is that it's intended as a standalone project, as such it's unrelated to libpostal and so I guess that's why no comparisons were made.

But yeah, we should do that, either here or somewhere else in the Pelias documentation 📖

Some more definitive answers:

There is no relationship except they are both address parsing libraries which we have been actively involved with.

In the beginning we used a regular-expression based parser and we found it to be too simple and not very accurate. So we met with Al and Mapzen funded the original work on libpostal which was integrated into the /v1/search endpoint upon the v1 release of libpostal, probably around 5 years ago now.

Libpostal is based on a machine learning model, it's a black box which unfortunately noone except Al has contributed to very much, it hasn't seen much development in the last 2 years or so, although there's actually been some activity this year!

Libpostal is amazing but it has a some negatives:

Its not easy to contribute to, the data build process is lengthly and not well documented, it's written in C
It only provides one parser solution, which is problematic for inputs like 'Ontario, CA'
It doesn't provide a score which can be used to discard low confidence parses
It has no support for partial tokens, although it'll try to provide a solution anyway, which is compounded by the above two points
It doesn't work well with short texts, incomplete texts, or those missing the locality name
It requires 2GB of RAM

The 'Pelias Parser' was never intended to replace libpostal, it is simply another option which we can iterate over faster because it's written in Javascript and is therefore more familiar a dev environment to our community.

In the process I tried to address some of the issues we had with libpostal, but of course writing a natural language parser isn't easy, especially for incomplete input!

Simply put, /v1/search uses libpostal and /v1/autocomplete uses pelias/parser. You can see the name of the parser used and also the parsed text within the geojson header. In some cases we will fail to match anything using libpostal, if that happens we will 'fall back' to pelias/parser and a looser query on /v1/search which usually helps find something close.
How long is a piece of string? see https://github.com/pelias/parser/tree/master/test
The pelias/parser is not intended to be magically perfect at reading peoples minds, instead for queries like 111 8th a it is expected that it either returns no solution or that the solution it returns is of low confidence. It's not intended to be a geocoder, but can provide a strong signal to the geocoder that it might be better off selecting a looser query since it's not quite clear what the intent is (yet).

Hope that helps ;)

blackmad commented 4 years ago

That's super helpful, thanks Peter! I might submit some PRs to READMEs and docs to get the ball rolling from here.

re: https://github.com/pelias/parser/tree/master/test - that doesn't really give me an intuition about what each one is good or bad at, that's the type of thing that you probably have a good sense for and would be good to write down somewhere?

missinglink commented 4 years ago

Machine Learning model

My intuition is that the machine learning model is superior in all inputs it was trained on. ie. fully formed postal addresses containing the locality name.

The weakness of the machine learning model is it performs poorly on anything it wasn't trained to recognise. Here's one example from an old issue of ours which predates the pelias parser https://github.com/pelias/api/issues/795#issuecomment-279458285, I just linked the first one I found, there are many more reports on the libpostal github if you're curious.

But the thing is it's a super super difficult problem and there's always going to be edge cases and people opening issues about how their address doesn't parse correctly.

I think one of the major strengths of this architecture is that, when the machine was trained, it saw many different formats of addresses from all over the world and hopefully learned to recognise their unique syntax patterns.

Dictionary and Pattern based Natural language processing model

This architecture is superior in terms of how easy it is to test and iterate on. The pelias/parser uses the dictionary files from libpostal for an unfair advantage, meaning that a lot of the dictionary terms required to build classifiers were imported from there.

On top of the token matching are logical classifiers which are able to look at the context of the terms, their adjacency etc. From this we can start building increasingly complex classifiers based on prior work, and each step is covered by tests as we go to prevent regressions.

So this model shines in its flexibility, a machine learning model would not be so easy to edit and add things like plus codes or 'pizza new york' or dynamic dictionaries.

The disadvantage is that it's written by humans, and we don't have knowledge of the whole world and its weird and wonderful addressing patterns, so that will need to be added by humans when we encounter a new and unusual address format.

missinglink commented 4 years ago

My intuition is that both parsers do very well in USA and French/German speaking countries, in other countries I suspect libpostal is stronger, particularly places like Japan/Korea where no pelias/parser country-specific classifiers have been added.

But there are some inputs which libpostal simply can't handle; or more accurately, can't indicate that it can't handle; for those pelias/parser is the only option.

missinglink commented 4 years ago

cc/ @blackmad this is a good example of where the pelias/parser needs human intervention from time-to-time https://github.com/pelias/pelias/issues/854

pelias / parser

Better documentation of what parser is? #82