openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.08k stars 421 forks source link

parse_address not working well for Japanese addresses #598

Open XiushuangLi opened 2 years ago

XiushuangLi commented 2 years ago

Hi!

I was checking out libpostal, and have a question: I read that this library supports Japanese addresses parsing, however, when I tried, it doesn't seem working well. So I would like to get some feedback from the awesome contributors (tried for other countries, and it works really great!)


My country is

US, but I'm using it for parsing Japanese addresses


Here's how I'm using libpostal

To help extract address from small business owner's website


Here's what I did

text = '〒100-8994 東京都千代田区丸ノ内2-7-2' parse_address(text)


Here's what I got

[('〒100-8994', 'postcode'), ('東', 'city'), ('京都千代田', 'city_district'), ('区', 'city'), ('丸ノ内', 'road'), ('2-7-2', 'house_number')]


Here's what I was expecting

postcode is correct, but "東京都" (means Tokyo Capital) is supposed to be city, "千代田区" is supposed to be city district

Here are a few other examples

Example 1 input: text = '〒550-0002 大阪府大阪市西区江戸堀1丁目18番21号' parse_address(text)

output: [('〒550-0002', 'postcode'), ('大', 'state'), ('阪', 'city'), ('府大阪市西', 'city_district'), ('区', 'city'), ('江戸堀', 'house'), ('1丁目', 'suburb'), ('18番', 'house_number'), ('21号', 'city_district')]

expected/correct parsing: 〒550-0002 大阪府 / 大阪市 / 西区 / 江戸堀 / 1丁目18番21号

Example 2 input: text = '〒064-0809 北海道札幌市中央区南9条西3丁目2−5' parse_address(text)

output: [('〒064-0809', 'postcode'), ('北', 'state'), ('海', 'city'), ('道札幌市中央区南9条西', 'road'), ('3丁目', 'suburb'), ('2-5', 'house_number')]

expected/correct parsing: 〒064-0809 北海道 / 札幌市 / 中央区 / 南9条西 / 3丁目2−5

Example 3 input: text = '〒604-8064 京都府京都市中京区骨屋之町560 離れ' parse_address(text)

output: [('〒604-8064', 'postcode'), ('京', 'state'), ('都', 'city'), ('府京都市中京区', 'city_district'), ('骨屋之町', 'road'), ('560', 'house_number'), ('離れ', 'road')]

expected/correct parsing: 〒604-8064 京都府 / 京都市 / 中京区 / 骨屋之町 / 560 離れ

Example 4 input: text = '〒460-0031 愛知県名古屋市中区本丸1−1' parse_address(text)

output: [('〒460-0031', 'postcode'), ('愛', 'state'), ('知県名古屋市中', 'city'), ('区', 'city_district'), ('本丸', 'suburb'), ('1-1', 'house_number')]

expected/correct parsing: 〒460-0031 愛知県 / 名古屋市 / 中区 / 本丸 / 1−1


For parsing issues, please answer "yes" or "no" to all that apply.


Here's what I think could be improved