pelias / parser

natural language classification engine for geocoding
https://parser.demo.geocode.earth
MIT License
55 stars 27 forks source link

Streetnames ending in -'burg', -'daal' not recognised #131

Closed emacgillavry closed 3 years ago

emacgillavry commented 3 years ago

Streetnames may end in -'burg', e.g. 'Clarenburg', 'Vredenburg' in Utrecht Streetnames may in in - 'daal', e.g. 'Bloemendaal', 'Groenendaal' in Gouda

emacgillavry commented 3 years ago

While both -'daal' and -'burg' don't refer to horoughfares, these are only recognised once they are added to street_types.txt, not street_names.txt.

These suffixes are only recognised once added to concatenated_suffixes_separable.txt, but these suffixes aren't used on their own, so should actually be in concatenated_suffixes_inseparable.txt.

Please clarify, @missinglink .

missinglink commented 3 years ago

Hi @emacgillavry sorry for the late reply, from what I can see the 'inseparable' file isn't currently being considered. You're correct in saying that the 'separable' file is already being used by classifier/CompoundStreetClassifier.js.

I guess we've never had a need to do this yet, we could add a classifier which looks at the suffix of a word and classifies it accordingly, this should be fairly trivial to code up.

Regarding -burg specifically I'm a little worried that this might cause issues with the German language where burg means castle and so we may have issues with places such as Charlottenburg in Berlin being incorrectly classified as a street. This may or may not be an issue in practice.

missinglink commented 3 years ago

I haven't looked at the code for a long while, it looks to me that you can add the suffixes to ..separable.txt and it won't have any unintended consequences.

But yeah, agreed it's not very intuitive, we should probably load both files.

emacgillavry commented 3 years ago

Thnx for your feedback. In the Netherlands, there are towns that have '-daal' or '-burg' in the name too, e.g.

I'll investigate how the suggested changes influence these town names from being found. Got maybe a bit too much carried away by street names, so will have to tread carefully

missinglink commented 3 years ago

there are towns that have '-daal' or '-burg' in the name too

Yeah that's confusing, but I guess that's just the way it is 🤷‍♂️

Those ambiguous words should therefore be classified as both a street and a locality with care taken to select appropriate confidence scores for each depending on context.

missinglink commented 3 years ago

Seems to work pretty well out-of-the-box with an additional file and some fairly minor changes:

cat resources/pelias/dictionaries/libpostal/nl/concatenated_suffixes_inseparable.txt
daal
diff --git a/classifier/CompoundStreetClassifier.js b/classifier/CompoundStreetClassifier.js
index 21c2874..79a843a 100644
--- a/classifier/CompoundStreetClassifier.js
+++ b/classifier/CompoundStreetClassifier.js
@@ -15,6 +15,12 @@ class CompoundStreetClassifier extends WordClassifier {
       // this removes suffixes such as 'r.' which can be ambiguous
       minlength: 3
     })
+
+    libpostal.load(this.suffixes, ['de', 'nl'], 'concatenated_suffixes_inseparable.txt', {
+      // remove any suffixes which contain less than 3 characters (excluding a period)
+      // this removes suffixes such as 'r.' which can be ambiguous
+      minlength: 3
+    })
   }

   each (span) {
diff --git a/test/address.nld.test.js b/test/address.nld.test.js
index fad71a0..731b356 100644
--- a/test/address.nld.test.js
+++ b/test/address.nld.test.js
@@ -12,6 +12,13 @@ const testcase = (test, common) => {
   assert('Bosserdijk, Hoogland', [
     { street: 'Bosserdijk' }, { locality: 'Hoogland' }
   ])
+
+  assert('Clarenburg', [[{ street: 'Clarenburg' }]], false)
+
+  assert('Bloemendaal', [
+    [{ locality: 'Bloemendaal' }],
+    [{ street: 'Bloemendaal' }]
+  ], false)
 }

 module.exports.all = (tape, common) => {

The false bit at the end is admittedly unintuitive, it tells the assert function that we'd like to test against all solutions rather than only the first one.

At this stage the pelias/api codebase only considers the top solution (the highest scoring one) but it might elect to use more or to skip the highest scoring solution at a later date.

emacgillavry commented 3 years ago

We'll add the suffix "-daal" concatenated_suffixes_separable.txt as these may appear separately, e.g. "De Daal, Deurne". We'll add the suffix "-burg" concatenated_suffixes_inseparable.txt.