add new CentralEuropeanStreetNameClassifier

missinglink commented 4 years ago

adds a new CentralEuropeanStreetNameClassifier which is able to handle the cases mentioned in https://github.com/pelias/parser/issues/83

it's still fairly basic, but relatively safe.

in the future we may consider expanding this to cover:

more than one unclassified span before the housenumber
the inverted order of 1 xxx instead of xxx 1 (although this might be dangerous?)

closes: https://github.com/pelias/parser/issues/83

Joxit commented 4 years ago

You are using section classifier and forcing length to 2, this definitely reduce side effects :+1:.

But we should be careful with words and phrases. In your PR the Alpha member should not be classified with a public classification, which is good IMO. But the section is composed by words... And one word can also be a phrase (#47). Here the word Paris is classified as an Alpha, but the phrase is classified as Locality... Theoretically this would mean that CentralEuropeanStreetNameClassifier should not classify it :confused: It's ok for now because the confidence is low, this is a reminder for me :sweat_smile:

$ node bin/cli.js Paris 75000, France

master:

================================================================
TOKENIZATION (2ms)
----------------------------------------------------------------
INPUT                           ➜  Paris 75000, France
SECTIONS                        ➜   Paris 75000   0:11    France  12:19 
S0 TOKENS                       ➜   Paris  0:5   7500  6:10 
S1 TOKENS                       ➜   France  13:19 
S0 PHRASES                      ➜   Paris 7500  0:10   Paris  0:5   7500  6:10 
S1 PHRASES                      ➜   France  13:19 

================================================================
CLASSIFICATIONS (4ms)
----------------------------------------------------------------
WORDS
----------------------------------------------------------------
Paris                           ➜   alpha  1.00   start_token  1.00  
75000                           ➜   numeric  1.00   housenumber  0.90   postcode  1.00  
France                          ➜   alpha  1.00   end_token  1.00  

----------------------------------------------------------------
PHRASES
----------------------------------------------------------------
Paris                           ➜   given_name  1.00   surname  1.00   area  1.00   locality  1.00  
France                          ➜   given_name  1.00   surname  1.00   area  1.00   country  0.90  

================================================================
SOLUTIONS (4ms)
----------------------------------------------------------------
(0.96) ➜ [ { locality: 'Paris' },
  { postcode: '75000' },
  { country: 'France' } ]

central_european_streets:

================================================================
TOKENIZATION (2ms)
----------------------------------------------------------------
INPUT                           ➜  Paris 75000, France
SECTIONS                        ➜   Paris 75000   0:11    France  12:19 
S0 TOKENS                       ➜   Paris  0:5   7500  6:10 
S1 TOKENS                       ➜   France  13:19 
S0 PHRASES                      ➜   Paris 75000  0:10   Paris  0:5   7500  6:10 
S1 PHRASES                      ➜   France  13:19 

================================================================
CLASSIFICATIONS (6ms)
----------------------------------------------------------------
WORDS
----------------------------------------------------------------
Paris                           ➜   alpha  1.00   start_token  1.00   street  0.50  
75000                           ➜   numeric  1.00   housenumber  0.90   postcode  1.00  
France                          ➜   alpha  1.00   end_token  1.00  

----------------------------------------------------------------
PHRASES
----------------------------------------------------------------
Paris                           ➜   given_name  1.00   surname  1.00   area  1.00   locality  1.00  
France                          ➜   given_name  1.00   surname  1.00   area  1.00   country  0.90  

================================================================
SOLUTIONS (4ms)
----------------------------------------------------------------
(0.96) ➜ [ { locality: 'Paris' },
  { postcode: '75000' },
  { country: 'France' } ]

(0.79) ➜ [ { street: 'Paris' },
  { postcode: '75000' },
  { country: 'France' } ]

(0.77) ➜ [ { street: 'Paris' },
  { housenumber: '75000' },
  { country: 'France' } ]

missinglink commented 4 years ago

Yeah agreed, it should ensure that the tokens have no public classifications at all.

missinglink commented 4 years ago

It's a really tricky case to handle without a gazetteer and/or a geocoder.

There is a street I cycle past quite often called Esplanade and I'm wondering how we will ever be able to correctly parse those addresses, eg Esplanade 17, 13187 Berlin, Germany

missinglink commented 4 years ago

Maybe we also add a check that the housenumber span doesn't also have a postcode classification.

missinglink commented 4 years ago

IMG_20200423_121013

Joxit commented 4 years ago

Nice, your PR seems to work for Esplanade too ! (Which is a street prefix in French)

$ node bin/cli.js Esplanade 17, 13187 Berlin, Germany

================================================================
TOKENIZATION (2ms)
----------------------------------------------------------------
INPUT                           ➜  Esplanade 17, 13187 Berlin, Germany
SECTIONS                        ➜   Esplanade 17  0:12    13187 Berlin  13:26    Germany  27:35 
S0 TOKENS                       ➜   Esplanade  0:9   17  10:12 
S1 TOKENS                       ➜   13187  14:19   Berlin  20:26 
S2 TOKENS                       ➜   Germany  28:35 
S0 PHRASES                      ➜   Esplanade 17  0:12   Esplanade  0:9   17  10:12 
S1 PHRASES                      ➜   13187 Berlin  14:26   13187  14:19   Berlin  20:26 
S2 PHRASES                      ➜   Germany  28:35 

================================================================
CLASSIFICATIONS (4ms)
----------------------------------------------------------------
WORDS
----------------------------------------------------------------
Esplanade                       ➜   alpha  1.00   start_token  1.00   street_prefix  1.00   street  0.50  
17                              ➜   numeric  1.00   housenumber  1.00  
13187                           ➜   numeric  1.00   housenumber  0.20   postcode  1.00  
Berlin                          ➜   alpha  1.00  
Germany                         ➜   alpha  1.00   end_token  1.00  

----------------------------------------------------------------
PHRASES
----------------------------------------------------------------
Berlin                          ➜   surname  1.00   area  1.00   locality  1.00   region  1.00  
Germany                         ➜   area  1.00   country  0.90  

================================================================
SOLUTIONS (4ms)
----------------------------------------------------------------
(0.82) ➜ [ { street: 'Esplanade' },
  { housenumber: '17' },
  { postcode: '13187' },
  { locality: 'Berlin' },
  { country: 'Germany' } ]

(0.82) ➜ [ { street: 'Esplanade' },
  { housenumber: '17' },
  { postcode: '13187' },
  { region: 'Berlin' },
  { country: 'Germany' } ]

missinglink commented 4 years ago

I just added two more test cases. I also added some code to check the parent phrases but it caused one test to fail, so I'm thinking we just leave it as-is for now?

pelias / parser

add new CentralEuropeanStreetNameClassifier #88