openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.08k stars 421 forks source link

"#number" after street name does not parse as unit number, but as street name. #671

Open arya6000 opened 2 months ago

arya6000 commented 2 months ago

Hi!

I was checking out libpostal, and saw something that could be improved.


My country is

US


Here's how I'm using libpostal

Parsing list of addresses in my city to store in a normalized relational database.


Here's what I did

Parsed the following address "1141 Kendall Town Blvd #3202, Jacksonville, FL 32225"


Here's what I got

house_number: 1141 road: kendall town blvd #3202 city: jacksonville state: fl postcode: 32225


Here's what I was expecting

house_number: 1141 unit: #3202 road: kendall town blvd city: jacksonville state: fl postcode: 32225


house_number: 1141 #3202 road: kendall town blvd city: jacksonville state: fl postcode: 32225

But "#3202" should be in listed under "unit" and not house number. However "1141 apt 3202 Kendall Town Blvd, Jacksonville, FL 32225" outputs the correct format

house_number: 1141 unit: apt 3202 road: kendall town blvd city: jacksonville state: fl postcode: 32225

Yes, the following results in correct output

"1141 apt 3202 Kendall Town Blvd, Jacksonville, FL 32225"


Here's what I think could be improved

If "# followed by numbers is listed before the city it would be treated as unit number.

brianmacy commented 2 months ago

Did you try the Senzing provided model?

On Sat, Sep 21, 2024 at 16:12 arya6000 @.***> wrote:

Hi!

I was checking out libpostal, and saw something that could be improved.

My country is

US

Here's how I'm using libpostal

Parsing list of addresses in my city to store in a normalized relational database.

Here's what I did

Parsed the following address "1141 Kendall Town Blvd #3202, Jacksonville, FL 32225 https://www.google.com/maps/search/1141+Kendall+Town+Blvd+%233202,+Jacksonville,+FL+32225?entry=gmail&source=g "

Here's what I got

house_number: 1141 road: kendall town blvd #3202 city: jacksonville state: fl postcode: 32225

Here's what I was expecting

house_number: 1141 unit: #3202 road: kendall town blvd https://www.google.com/maps/search/kendall+town+blvd+%0D%0Acity:+jacksonville+%0D%0Astate:+fl+%0D%0Apostcode:+32225?entry=gmail&source=g city: jacksonville https://www.google.com/maps/search/kendall+town+blvd+%0D%0Acity:+jacksonville+%0D%0Astate:+fl+%0D%0Apostcode:+32225?entry=gmail&source=g state: fl https://www.google.com/maps/search/kendall+town+blvd+%0D%0Acity:+jacksonville+%0D%0Astate:+fl+%0D%0Apostcode:+32225?entry=gmail&source=g postcode: 32225 https://www.google.com/maps/search/kendall+town+blvd+%0D%0Acity:+jacksonville+%0D%0Astate:+fl+%0D%0Apostcode:+32225?entry=gmail&source=g

-

Does the input address exist in OpenStreetMap https://openstreetmap.org? No

Do all the toponyms exist in OSM (city, state, region names, etc.)? City and state are in OSM

If the address uses a rare/uncommon format, does changing the order of the fields yield the correct result? "1141 #3202 Kendall Town Blvd, Jacksonville, FL 32225 https://www.google.com/maps/search/Kendall+Town+Blvd,+Jacksonville,+FL+32225?entry=gmail&source=g" results in the following format

house_number: 1141 #3202 road: kendall town blvd https://www.google.com/maps/search/kendall+town+blvd+%0D%0Acity:+jacksonville+%0D%0Astate:+fl+%0D%0Apostcode:+32225?entry=gmail&source=g city: jacksonville https://www.google.com/maps/search/kendall+town+blvd+%0D%0Acity:+jacksonville+%0D%0Astate:+fl+%0D%0Apostcode:+32225?entry=gmail&source=g state: fl https://www.google.com/maps/search/kendall+town+blvd+%0D%0Acity:+jacksonville+%0D%0Astate:+fl+%0D%0Apostcode:+32225?entry=gmail&source=g postcode: 32225 https://www.google.com/maps/search/kendall+town+blvd+%0D%0Acity:+jacksonville+%0D%0Astate:+fl+%0D%0Apostcode:+32225?entry=gmail&source=g

But "#3202" should be in listed under "unit" and not house number. However "1141 apt 3202 Kendall Town Blvd, Jacksonville, FL 32225 https://www.google.com/maps/search/Kendall+Town+Blvd,+Jacksonville,+FL+32225?entry=gmail&source=g" outputs the correct format

house_number: 1141 unit: apt 3202 road: kendall town blvd https://www.google.com/maps/search/kendall+town+blvd+%0D%0Acity:+jacksonville+%0D%0Astate:+fl+%0D%0Apostcode:+32225?entry=gmail&source=g city: jacksonville https://www.google.com/maps/search/kendall+town+blvd+%0D%0Acity:+jacksonville+%0D%0Astate:+fl+%0D%0Apostcode:+32225?entry=gmail&source=g state: fl https://www.google.com/maps/search/kendall+town+blvd+%0D%0Acity:+jacksonville+%0D%0Astate:+fl+%0D%0Apostcode:+32225?entry=gmail&source=g postcode: 32225 https://www.google.com/maps/search/kendall+town+blvd+%0D%0Acity:+jacksonville+%0D%0Astate:+fl+%0D%0Apostcode:+32225?entry=gmail&source=g

  • If the address contains apartment/floor/sub-building information or uncommon formatting, does removing that help? Is there any minimum form of the address that gets the right parse?

Yes, the following results in correct output

"1141 apt 3202 Kendall Town Blvd, Jacksonville, FL 32225 https://www.google.com/maps/search/Kendall+Town+Blvd,+Jacksonville,+FL+32225?entry=gmail&source=g "

Here's what I think could be improved

If "# followed by numbers is listed before the city it would be treated as unit number.

— Reply to this email directly, view it on GitHub https://github.com/openvenues/libpostal/issues/671, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF6OZVH54OBZC3COXQSRSUTZXXHKHAVCNFSM6AAAAABOT3YFGKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGU2DANBZGM3DMNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

arya6000 commented 2 months ago

Did you try the Senzing provided model?

I was not aware of Senzing. You are referring to this https://github.com/Senzing/libpostal-data ?

brianmacy commented 2 months ago

Yes. If you search the libpostal docs for alternative data models you should see how to enable it.

On Sat, Sep 21, 2024 at 18:55 arya6000 @.***> wrote:

Did you try the Senzing provided model?

I was not aware of Senzing. You are referring to this https://github.com/Senzing/libpostal-data ?

— Reply to this email directly, view it on GitHub https://github.com/openvenues/libpostal/issues/671#issuecomment-2365348330, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF6OZVDLW2MSXXAM4TSLPRTZXX2MNAVCNFSM6AAAAABOT3YFGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRVGM2DQMZTGA . You are receiving this because you commented.Message ID: @.***>

arya6000 commented 2 months ago

Yes. If you search the libpostal docs for alternative data models you should see how to enable it.

I just tried with a the Senzing model and it solved the issue. Thanks