Closed missinglink closed 4 years ago
You are using section classifier and forcing length to 2, this definitely reduce side effects :+1:.
But we should be careful with words and phrases. In your PR the Alpha member should not be classified with a public classification, which is good IMO. But the section is composed by words
... And one word can also be a phrase
(#47).
Here the word Paris
is classified as an Alpha
, but the phrase is classified as Locality
... Theoretically this would mean that CentralEuropeanStreetNameClassifier
should not classify it :confused:
It's ok for now because the confidence is low, this is a reminder for me :sweat_smile:
$ node bin/cli.js Paris 75000, France
master:
================================================================
TOKENIZATION (2ms)
----------------------------------------------------------------
INPUT ➜ Paris 75000, France
SECTIONS ➜ Paris 75000 0:11 France 12:19
S0 TOKENS ➜ Paris 0:5 7500 6:10
S1 TOKENS ➜ France 13:19
S0 PHRASES ➜ Paris 7500 0:10 Paris 0:5 7500 6:10
S1 PHRASES ➜ France 13:19
================================================================
CLASSIFICATIONS (4ms)
----------------------------------------------------------------
WORDS
----------------------------------------------------------------
Paris ➜ alpha 1.00 start_token 1.00
75000 ➜ numeric 1.00 housenumber 0.90 postcode 1.00
France ➜ alpha 1.00 end_token 1.00
----------------------------------------------------------------
PHRASES
----------------------------------------------------------------
Paris ➜ given_name 1.00 surname 1.00 area 1.00 locality 1.00
France ➜ given_name 1.00 surname 1.00 area 1.00 country 0.90
================================================================
SOLUTIONS (4ms)
----------------------------------------------------------------
(0.96) ➜ [ { locality: 'Paris' },
{ postcode: '75000' },
{ country: 'France' } ]
central_european_streets:
================================================================
TOKENIZATION (2ms)
----------------------------------------------------------------
INPUT ➜ Paris 75000, France
SECTIONS ➜ Paris 75000 0:11 France 12:19
S0 TOKENS ➜ Paris 0:5 7500 6:10
S1 TOKENS ➜ France 13:19
S0 PHRASES ➜ Paris 75000 0:10 Paris 0:5 7500 6:10
S1 PHRASES ➜ France 13:19
================================================================
CLASSIFICATIONS (6ms)
----------------------------------------------------------------
WORDS
----------------------------------------------------------------
Paris ➜ alpha 1.00 start_token 1.00 street 0.50
75000 ➜ numeric 1.00 housenumber 0.90 postcode 1.00
France ➜ alpha 1.00 end_token 1.00
----------------------------------------------------------------
PHRASES
----------------------------------------------------------------
Paris ➜ given_name 1.00 surname 1.00 area 1.00 locality 1.00
France ➜ given_name 1.00 surname 1.00 area 1.00 country 0.90
================================================================
SOLUTIONS (4ms)
----------------------------------------------------------------
(0.96) ➜ [ { locality: 'Paris' },
{ postcode: '75000' },
{ country: 'France' } ]
(0.79) ➜ [ { street: 'Paris' },
{ postcode: '75000' },
{ country: 'France' } ]
(0.77) ➜ [ { street: 'Paris' },
{ housenumber: '75000' },
{ country: 'France' } ]
Yeah agreed, it should ensure that the tokens have no public classifications at all.
It's a really tricky case to handle without a gazetteer and/or a geocoder.
There is a street I cycle past quite often called Esplanade
and I'm wondering how we will ever be able to correctly parse those addresses, eg Esplanade 17, 13187 Berlin, Germany
Maybe we also add a check that the housenumber
span doesn't also have a postcode
classification.
Nice, your PR seems to work for Esplanade too ! (Which is a street prefix in French)
$ node bin/cli.js Esplanade 17, 13187 Berlin, Germany
================================================================
TOKENIZATION (2ms)
----------------------------------------------------------------
INPUT ➜ Esplanade 17, 13187 Berlin, Germany
SECTIONS ➜ Esplanade 17 0:12 13187 Berlin 13:26 Germany 27:35
S0 TOKENS ➜ Esplanade 0:9 17 10:12
S1 TOKENS ➜ 13187 14:19 Berlin 20:26
S2 TOKENS ➜ Germany 28:35
S0 PHRASES ➜ Esplanade 17 0:12 Esplanade 0:9 17 10:12
S1 PHRASES ➜ 13187 Berlin 14:26 13187 14:19 Berlin 20:26
S2 PHRASES ➜ Germany 28:35
================================================================
CLASSIFICATIONS (4ms)
----------------------------------------------------------------
WORDS
----------------------------------------------------------------
Esplanade ➜ alpha 1.00 start_token 1.00 street_prefix 1.00 street 0.50
17 ➜ numeric 1.00 housenumber 1.00
13187 ➜ numeric 1.00 housenumber 0.20 postcode 1.00
Berlin ➜ alpha 1.00
Germany ➜ alpha 1.00 end_token 1.00
----------------------------------------------------------------
PHRASES
----------------------------------------------------------------
Berlin ➜ surname 1.00 area 1.00 locality 1.00 region 1.00
Germany ➜ area 1.00 country 0.90
================================================================
SOLUTIONS (4ms)
----------------------------------------------------------------
(0.82) ➜ [ { street: 'Esplanade' },
{ housenumber: '17' },
{ postcode: '13187' },
{ locality: 'Berlin' },
{ country: 'Germany' } ]
(0.82) ➜ [ { street: 'Esplanade' },
{ housenumber: '17' },
{ postcode: '13187' },
{ region: 'Berlin' },
{ country: 'Germany' } ]
I just added two more test cases. I also added some code to check the parent phrases but it caused one test to fail, so I'm thinking we just leave it as-is for now?
adds a new
CentralEuropeanStreetNameClassifier
which is able to handle the cases mentioned in https://github.com/pelias/parser/issues/83it's still fairly basic, but relatively safe.
in the future we may consider expanding this to cover:
1 xxx
instead ofxxx 1
(although this might be dangerous?)closes: https://github.com/pelias/parser/issues/83