openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.06k stars 417 forks source link

problem with Japanese addresses #62

Closed markusr closed 7 years ago

markusr commented 8 years ago

Hi,

I am trying to parse addresses from PATSTAT (EPO Worldwide Patent Statistical Database). For most countries the parser works fine but I have problems with the Japanese ones.

E.g. postal.parser.parse_address("135, Higashifunahashi 2 Chome, Hirakata-shi Osaka-fu")

[(u'135', u'house_number'), (u'higashifunahashi 2 chome', u'road'), (u'hirakata-shi', u'house'), (u'osaka-fu', u'road')]

Is the input format wrong? I can provide more input strings if needed. I use the python wrapper.

albarrentine commented 8 years ago

Interesting use case.

The current parser was not trained with Japanese addresses because the address format varies based on which script is being used (in Han script the address is written in the reverse order as when it's written in Latin script) and at first the configs in the address-formatting repo, which we use to transform OpenStreetMap tags into structured addresses, couldn't accommodate that.

I've since made a pull request which allows us to vary the format by language, so when the new version of the parser is trained it will train on many examples in Japan.

That said, the address you posted is a little more complicated. Most Japanese addresses that we have from OSM are written only in Kanji/Katakana, rather than Romaji, so even the new parser might not do well on that particular form. It will be trained on addresses that look like: 日本国 〒112-0001東京都文京区白山4丁目3-2田中 太郎 様

For Romaji, many of the larger administrative areas in OSM have name:ja_rm and name:en listed out in addition to their default Kanji names, so forms like "Hirakata-shi" and "Osaka-fu" would probably make it into the training set. For the neighborhoods, smaller districts, etc. we'd have to transliterate Kanji, which libpostal currently doesn't support (see this thread in ICU for why not, we use the same underlying data as ICU for transliteration). It could possibly be done using a statistical word segmenter e.g. MeCab, but that's not likely to happen anytime soon. Words like "chome" that are in our Japanese dictionaries could be segmented correctly, and the parser could probably still do something useful with that information, but no promises.

If you have lat/lons for the addresses in your data set, and are willing to contribute some of them to OpenStreetMap, libpostal will pull them in automatically and use them as training data. The addresses would need to be segmented and labeled like:

addr:housenumber=135
addr:street=Higashifunahashi 2 Chome
addr:city=Hirakata-shi

Though most Japanese streets don't have names, I think the convention in OSM is to call "Higashifunahashi 2 Chome" the street name even though as far as I can tell it's more like a neighborhood or district. It's usually ok to omit the city, prefecture, etc. as we can simply reverse geocode to the containing polygons to get some of those names.

Alternatively, if you can label some addresses using libpostal's tags and post them here, I can incorporate them into the next model training. That format looks like:

135/house_number Higashifunahashi/road 2/road Chome/road Hirakata-shi/city Osaka-fu/state

Something like 50-100 examples from various regions should allow the model to learn useful patterns for parsing similar types of addresses. libpostal is a pure language model, so it doesn't need to see every conceivable address, just the important common structures.

markusr commented 8 years ago

Thank you for your long response. I understand the problem in general.

For Romaji, many of the larger administrative areas in OSM have name:ja_rm and name:en listed out in addition to their default Kanji names, so forms like "Hirakata-shi" and "Osaka-fu" would probably make it into the training set. For the neighborhoods, smaller districts, etc. we'd have to transliterate Kanji, which libpostal currently doesn't support (see this thread in ICU for why not, we use the same underlying data as ICU for transliteration). It could possibly be done using a statistical word segmenter e.g. MeCab, but that's not likely to happen anytime soon. Words like "chome" that are in our Japanese dictionaries could be segmented correctly, and the parser could probably still do something useful with that information, but no promises.

For my use case it would be enough that city and state are correctly detected.

If you have lat/lons for the addresses in your data set, and are willing to contribute some of them to OpenStreetMap, libpostal will pull them in automatically and use them as training data. The addresses would need to be segmented and labeled like:

Unfortunately there is no geo position.

Alternatively, if you can label some addresses using libpostal's tags and post them here, I can incorporate them into the next model training.

I tried to do the manual labeling. However, I was unsure with some entries. There are certain tokens in the string like 'CITY' where I did not know what to put, so I put unknown.

training_database_jp.xlsx

Can you check the xlsx and give me some feedback what I need to change.

Greetings Markus

P.S.: I also found this blog but I don't know if it fits.

albarrentine commented 7 years ago

Hey @markusr,

libpostal 1.0 is out today and has been trained on millions of Japanese addresses, primarily in Kanji but also in Romaji as above. New parser result:

> 135, Higashifunahashi 2 Chome, Hirakata-shi Osaka-fu

Result:

{
  "house_number": "135",
  "road": "higashifunahashi",
  "suburb": "2 chome",
  "city": "hirakata-shi",
  "state": "osaka-fu"
}

Should work pretty well for Japan overall, especially with things like city, state. There can still be a few small issues with the suburb portion (arguably "higashifunahashi" should also be part of the suburb name, not the street) because not every form of every name will necessarily have been seen by libpostal before in training, especially the Romaji versions, so it might tend toward classifying words it hasn't seen before as roads. There's probably some more work to do on Japanese addresses in general, but at least the parser can now make some sense of them.