Open huu4ontocord opened 2 years ago
This test should be done after an stdnum and a date test. in this way, we know it's not a stdnum and not a date
@paulovn
the idea looks fine to me, and using libpostal excluding house
seems a sane approach.
Except for the base regex, which in reality is not fully language independent. The current one proposed is
number + space + name + comma + number
That may work for US English, though it might miss the state (e.g. 10 Boulevard Rd Los Angeles, CA 38718
). But e.g. in British addresses the postcode has numbers AND letters : 71 Cherry Court Southampton SO53 5PD
And other languages can be more diverging: These are a few examples of sequences I came up with:
number + s + name + s + street-designator + s + postcode + s + city
street-designator + s + name + s + comma + s + number + [comma + name/number] + s + postcode-number + s + city
name + street-designator + s + number + s + postcode-number + s + city
(a speciall feature of German addresses is the lack of space between the street name and street designer, e.g. Hauptstraße )number + s + street-designator + s+ name + s + postcode-number + s + city
Other countries might fit in these patterns, for instance Portugal is quite similar to Spain (but Brazil puts postcode after the city) The point is, we would need a library of regexes by lang/country. I would rather make them liberal (e.g. don't force having a comma) to fit in more variations.
The wikipedia page about addresses has a good recollection: https://en.wikipedia.org/wiki/Address
@shamikbose see above ^^.
Thanks, @ontocord ! I will look into it this weekend
Add regex for basic potential addresses such as a \d+ followed by \s+ and a \w {5,30} and a comma and then another \d+. Then test if there's no stopwords within the \w, and then feed the whole thing to libpostal to check if there is an address. Libpostal will tell us house, road, etc. We need to check if there is a road, etc. "house" doesn't really tell us anything as that is almost always caught.