Need help in contributing to ms dictionary

openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.

MIT License

4.02k stars 417 forks source link

Need help in contributing to ms dictionary #121

Open Jeffrey04 opened 7 years ago

Jeffrey04 commented 7 years ago

Malaysian address is mainly written in Malay. However, it is not uncommon to see certain parts written in other languages (always romanized), for e.g.

Romanized Chinese: Chan Sow Lin

IBS Centre, Level 1, Block E, Lot 8, Jalan Chan Sow Lin, 55200 Kuala Lumpur, Malaysia

Romanized (possibly) Tamil: Sambanthan

Menara Shell. No. 211 Jalan Tun Sambanthan. 50470 Kuala Lumpur

Sometimes, certain words can be written in either english and malay, e.g.

Damansara Heights (English) instead of Bukit Damansara (malay)

United Nations Malaysia, Wisma UN, Block C Kompleks Pejabat Damansara Jalan Dungun, Damansara Heights 50490 Kuala Lumpur

another example: kondo/condo, blok/block

1-12-04, tingkat 12, blok 1, Kondo Rakyat Desa Pantai, Jalan Pantai Murni 2, Pantai, 59200 Kuala Lumpur

can be rewritten as

1-12-04, 12th floor, block 1, Rakyat Desa Pantai Condo, Jalan Pantai Murni 2, Pantai, 59200 Kuala Lumpur

So as far as I see in the dictionary I assume there's no order (besides each entry should start with the full spelling of term, followed by abbreviations). Is there anything I need to do to allow libpostal to handle our addresses? For example, is this necessary?

condominium|kondominium|condo|kondo

kondominium|kondo

is already sufficient?

Also having working exclusively with Malaysian address, I am not sure if this applies to Brunei (ms-BN) too.

albarrentine commented 7 years ago

Ok, so as far as the parser goes, the version that's currently in master is probably not going to work very well for your task because it doesn't handle any sub-building information (12th floor, block 1, apartment numbers, etc.) It's only designed to deal with the fields that are commonly used in OSM e.g. house_number, road, city, postcode, etc. I'd also say that Malaysia is not very well-represented in OSM relative to the number of people that live there, so there are maybe 10k training addresses in OSM, which is orders of magnitude smaller than for other countries. It's still possible to get pseudo-addresses from the road network, intersections, etc.

There's a huge rewrite of the parser being done in the parser-data branch. It's a work-in-progress and may not be ready to use for a little bit, but when it is, it includes several new address components such as unit and level. The way we accomplish that is by randomly generating sub-building information in different languages using the dictionaries, so if we get "house_number=211" and "road=Jalan Tun Sambanthan" from OSM, we might generate "apt 2" and "5th floor". There are detailed YAML configs that do this for each language. Malay is not currently included in these configs because, as you say, the sub-building info is written in English most of the time. It would be easy to copy the English config and use the slightly expanded spellings from the Malay dictionaries (in that case, you would write all the synonyms like condominium|kondominium|condo|kondo in the Malay dictionary - that means if we see "condominium" in OSM where they don't abbreviate, we can sample randomly from the dictionary and occasionally produce "kondo", etc.) Then there are some additional formatting configs which are used to specify the order of the newly added components (level comes after house number, etc.)

When that's ready to deploy, it should do reasonably well on Malaysian addresses, although I'd be more confident if there were more data in OSM. If you have any address data sets that can be contributed to either OSM or OpenAddresses, that would be another way to improve libpostal.

Jeffrey04 commented 7 years ago

Unfortunately I don't own the data, so I cannot contribute them. Also most of my data do not even have coordinates attached to it.

I am actually also building a parser based on what I have (not open, and also part of my job). One thing I find frustrating about unit number for highrise buildings is that they are written in all possible permutation ($block-$level-$number, $number.$level.$block, $block$level$number etc.).

While I cannot provide addresses, I can certainly start contributing to the dictionary (aiming to send in a pull request today or early tomorrow). I probably can also start contributing random addresses to OpenAddresses from my self-composed test-set once I have them.

albarrentine commented 7 years ago

Hey @Jeffrey04, libpostal 1.0 has been released, and can handle units, floors, etc.

The block format used in Malaysia and Singapore is still something I'm trying to figure out how to handle consistently around the world, so that's not part of the release.

For parsing house numbers like "1-12-04", we're just going to call that the house number and not try to break it up into its constituent parts (if that's hard in one language, it's almost impossible to do across languages).

Jeffrey04 commented 7 years ago

@thatdatabaseguy not sure about singapore, but in malaysia, based on the data i have in hand, there is no consistent block format.

and regarding breaking up the house numbers... if you come out with a solution, i probably always have a counterexample for that S:

anyway i will check the dictionary and see what i can contribute (:

bassrock commented 6 years ago

The usps has some logic provided on how you should read high rise addresses in the us: https://ribbs.usps.gov/cassmassguidelines/CASS%20and%20MASS%20Guidelines/508Version/address_match_parsing_highrise_building.htm

albarrentine commented 6 years ago

@bassrock again in that case, "12-A" would be a single token which the parser model would likely label house_number, which is accurate enough as far as the parsing task is concerned (and the post office standard isn't even totally consistent with reality there - it's just as easy to find a "12-A" that's simply a different side of the building and was added on after the street was numbered, so sometimes it can be considered a distinct address from "12", really can't know for sure without looking at the building). As I've mentioned before in other issues, users can always feel free to implement their own country-specific splitting on the house_number tag from libpostal as a post-processing step, but we're not going to try to break it down further and guess the meaning in libpostal, at least not in the parser model.

The parser has no explicit knowledge of countries (it's one global model), so to parse compound house numbers in an international way, we'd at minimum have to predict country given the text, which is more difficult than predicting language for all the various types of input libpostal has to handle e.g. if the input is only a house number and street name. For instance, it's not clear from "12-A E 27th St" whether this is the US or Canada or some other English-speaking country. If it's Canada, the apartment number would be on the left of the hyphen, not the right, so even in that relatively simple case a rules-based approach breaks down pretty quickly.

A machine learning approach would require substantive changes to how our model works, either allowing multiple distinct labels per token, and/or changing the tokenizer to break up numeric hyphens. In either case, we'd have to re-label all of the training data for all countries. It would be a fairly major effort, and I'm unclear on the benefits or goals of doing this in the parser instead of a post-processing step as suggested.