openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.04k stars 417 forks source link

Rooms within a property are causing confusion #661

Open mortonc opened 4 months ago

mortonc commented 4 months ago

Hi!

I noticed that whenever an address has a sort of sub-unit e.g "room 3, flat 22, 50 Downing Street, London" It would mark the room as the house name which is inaccurate.

e.g


 ('flat 22', 'unit'),
 ('50', 'house_number'),
 ('downing street', 'road'),
 ('london', 'city')]```

Removing the room number does resolve this however I am interested in retaining this information. Is there a way to parse room numbers or alternative unit/sub-unit types?
albarrentine commented 4 months ago

That seems like a non-standard edge case, and that address doesn't appear to exist. We handle some types of academic addresses that might have a building and rooms, but have never seen a room within a flat in an organic data set, so it's not generated in the training data. Of course anything's possible in the UK, but if it's only a few addresses I would just regex it out or relabel the house name after parsing. There are legitimate venue names that can be something like "Room 3", so the model's unlikely to be able to distinguish them.