huu4ontocord / rio

Text pre-processing for NLP datasets
Apache License 2.0
11 stars 6 forks source link

Add libpostal address detection #23

Open huu4ontocord opened 2 years ago

huu4ontocord commented 2 years ago

Add regex for basic potential addresses such as a \d+ followed by \s+ and a \w {5,30} and a comma and then another \d+. Then test if there's no stopwords within the \w, and then feed the whole thing to libpostal to check if there is an address. Libpostal will tell us house, road, etc. We need to check if there is a road, etc. "house" doesn't really tell us anything as that is almost always caught.

huu4ontocord commented 2 years ago

This test should be done after an stdnum and a date test. in this way, we know it's not a stdnum and not a date

huu4ontocord commented 2 years ago

@paulovn

paulovn commented 2 years ago

the idea looks fine to me, and using libpostal excluding house seems a sane approach.

Except for the base regex, which in reality is not fully language independent. The current one proposed is

That may work for US English, though it might miss the state (e.g. 10 Boulevard Rd Los Angeles, CA 38718 ). But e.g. in British addresses the postcode has numbers AND letters : 71 Cherry Court Southampton SO53 5PD

And other languages can be more diverging: These are a few examples of sequences I came up with:

Other countries might fit in these patterns, for instance Portugal is quite similar to Spain (but Brazil puts postcode after the city) The point is, we would need a library of regexes by lang/country. I would rather make them liberal (e.g. don't force having a comma) to fit in more variations.

The wikipedia page about addresses has a good recollection: https://en.wikipedia.org/wiki/Address

huu4ontocord commented 2 years ago

@shamikbose see above ^^.

shamikbose commented 2 years ago

Thanks, @ontocord ! I will look into it this weekend