parse_address not working well for Japanese addresses

Hi!

I was checking out libpostal, and have a question: I read that this library supports Japanese addresses parsing, however, when I tried, it doesn't seem working well. So I would like to get some feedback from the awesome contributors (tried for other countries, and it works really great!)

My country is

US, but I'm using it for parsing Japanese addresses

Here's how I'm using libpostal

To help extract address from small business owner's website

Here's what I did

text = '〒100-8994 東京都千代田区丸ノ内2-7-2' parse_address(text)

Here's what I got

[('〒100-8994', 'postcode'), ('東', 'city'), ('京都千代田', 'city_district'), ('区', 'city'), ('丸ノ内', 'road'), ('2-7-2', 'house_number')]

Here's what I was expecting

postcode is correct, but "東京都" (means Tokyo Capital) is supposed to be city, "千代田区" is supposed to be city district

Here are a few other examples

Example 1 input: text = '〒550-0002 大阪府大阪市西区江戸堀１丁目１８番２１号' parse_address(text)

output: [('〒550-0002', 'postcode'), ('大', 'state'), ('阪', 'city'), ('府大阪市西', 'city_district'), ('区', 'city'), ('江戸堀', 'house'), ('１丁目', 'suburb'), ('１８番', 'house_number'), ('２１号', 'city_district')]

expected/correct parsing: 〒550-0002 大阪府 / 大阪市 / 西区 / 江戸堀 / １丁目１８番２１号

Example 2 input: text = '〒064-0809 北海道札幌市中央区南９条西３丁目２−5' parse_address(text)

output: [('〒064-0809', 'postcode'), ('北', 'state'), ('海', 'city'), ('道札幌市中央区南９条西', 'road'), ('３丁目', 'suburb'), ('２-5', 'house_number')]

expected/correct parsing: 〒064-0809 北海道 / 札幌市 / 中央区 / 南９条西 / ３丁目２−5

Example 3 input: text = '〒604-8064 京都府京都市中京区骨屋之町560 離れ' parse_address(text)

output: [('〒604-8064', 'postcode'), ('京', 'state'), ('都', 'city'), ('府京都市中京区', 'city_district'), ('骨屋之町', 'road'), ('560', 'house_number'), ('離れ', 'road')]

expected/correct parsing: 〒604-8064 京都府 / 京都市 / 中京区 / 骨屋之町 / 560 離れ

Example 4 input: text = '〒460-0031 愛知県名古屋市中区本丸１−1' parse_address(text)

output: [('〒460-0031', 'postcode'), ('愛', 'state'), ('知県名古屋市中', 'city'), ('区', 'city_district'), ('本丸', 'suburb'), ('１-1', 'house_number')]

expected/correct parsing: 〒460-0031 愛知県 / 名古屋市 / 中区 / 本丸 / １−1

For parsing issues, please answer "yes" or "no" to all that apply.

Does the input address exist in OpenStreetMap?
Do all the toponyms exist in OSM (city, state, region names, etc.)?
If the address uses a rare/uncommon format, does changing the order of the fields yield the correct result?
If the address does not contain city, region, etc., does adding those fields to the input improve the result?
If the address contains apartment/floor/sub-building information or uncommon formatting, does removing that help? Is there any minimum form of the address that gets the right parse?

openvenues / libpostal