pelias / parser

natural language classification engine for geocoding
https://parser.demo.geocode.earth
MIT License
55 stars 27 forks source link

investigate ambiguous parsing of the -burg suffix in NL/DE #152

Open missinglink opened 2 years ago

missinglink commented 2 years ago

Today we are merging https://github.com/pelias/api/pull/1565 which brings a bunch of pelias/parser changes into pelias/api.

As part of this process we did some wider acceptance test checks and diff'd them against the current baseline.

One change which was identified was this query (at partial completion "grolmanstrasse 51, charlottenburg") which identifies the Berlin borough charlottenburg as a street.

 grolmanstrasse 51, charlottenburg, berlin
-FFFFFFFFFFFFFFFF0000000000000000000000000
+FFFFFFFFFFFFFFFF0000000000000000FFFF0FFF0

This was likely introduced in the recent NL work https://github.com/pelias/parser/pull/126.

I would like to see if we can find a better way of handling the ambiguities between German and Dutch for the -burg suffix.

note: the correct solution is also being generated, but they both score the same, this scoring is based on matched token length so a robust fix would need to work equally well in cases where the len(street) < len(borough) as len(street) > len(borough) and len(street) == len(borough)

================================================================
SOLUTIONS (2ms)
----------------------------------------------------------------
(0.53) ➜ [ { housenumber: '51' }, { street: 'Charlottenburg' } ]

(0.53) ➜ [ { street: 'Grolmanstrasse' }, { housenumber: '51' } ]
missinglink commented 2 years ago

related: https://github.com/pelias/parser/issues/131#issuecomment-783072724