openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.08k stars 421 forks source link

A space in the city confused the parser #675

Closed CamdenParker closed 1 week ago

CamdenParker commented 1 week ago

Hi!

I was checking out libpostal, and saw something that could be improved.


My country is

USA


Here's how I'm using libpostal

I am using it in support of entity resolution amongst a myriad of environmental, health and safety data sources.


Here's what I did:

 ./src/address_parser
Loading models...

Welcome to libpostal's address parser.

Type in any address to parse and print the result.

Special commands:
.exit to quit the program

> 13775 CLARK RD, ROSE MOUNT, MN

Here's what I got:

The parser seemingly preferred to create a street name with a comma in it rather than a city with two words?

Result:

{
  "house_number": "13775",
  "road": "clark rd rose",
  "city": "mount",
  "state": "mn"
}

Here's what I was expecting:

{
  "house_number": "13775",
  "road": "clark rd",
  "city": "rose mount",
  "state": "mn"
}

For parsing issues, please answer "yes" or "no" to all that apply.

Result:

{ "house_number": "13775", "road": "clark rd rose", "city": "mount", "state": "mn", "country": "usa" }

13775 CLARK RD, ROSE MOUNT, MN 55068, USA

Result:

{ "house_number": "13775", "road": "clark rd rose", "city": "mount", "state": "mn", "postcode": "55068", "country": "usa" }

13775 CLARK RD, ROSE MOUNT, MN 55068

Result:

{ "house_number": "13775", "road": "clark rd rose", "city": "mount", "state": "mn", "postcode": "55068" }

- If the address contains apartment/floor/sub-building information or uncommon formatting, does removing that help? Is there any minimum form of the address that gets the right parse?
  <!-- yes/no. Please paste any examples of forms of the address that do work. Trying to do some of this investigation yourself before asking for help can save time for maintainers and improve your understanding of the project. -->

13775 CLARK RD, ROSEMOUNT, MN

Result:

{ "house_number": "13775", "road": "clark rd", "city": "rosemount", "state": "mn" }


---
#### Here's what I think could be improved:

Thinking maybe there were some edge cases in the training data where a street name came after a comma? Idrk
<!-- suggestions for what could be done differently -->
brianmacy commented 1 week ago

Have you tried this with the Senzing provided data model?

On Tue, Nov 12, 2024 at 15:27 Camden Parker @.***> wrote:

Hi!

I was checking out libpostal, and saw something that could be improved.

My country is

USA

Here's how I'm using libpostal

I am using it in support of entity resolution amongst a myriad of environmental, health and safety data sources.

Here's what I did:

./src/address_parser Loading models...

Welcome to libpostal's address parser.

Type in any address to parse and print the result.

Special commands: .exit to quit the program

13775 CLARK RD, ROSE MOUNT, MN https://www.google.com/maps/search/13775+CLARK+RD,+ROSE+MOUNT,+MN?entry=gmail&source=g


Here's what I got:

The parser seemingly preferred to create a street name with a comma in it rather than a city with two words?

Result:

{ "house_number": "13775", "road": "clark rd rose", "city": "mount", "state": "mn" }


Here's what I was expecting:

{ "house_number": "13775", "road": "clark rd", "city": "rose mount", "state": "mn" }


For parsing issues, please answer "yes" or "no" to all that apply.

  • Does the input address exist in OpenStreetMap https://openstreetmap.org? No
  • Do all the toponyms exist in OSM (city, state, region names, etc.)?
  • If the address uses a rare/uncommon format, does changing the order of the fields yield the correct result?
  • If the address does not contain city, region, etc., does adding those fields to the input improve the result? No

13775 CLARK RD, ROSE MOUNT, MN, USA https://www.google.com/maps/search/13775+CLARK+RD,+ROSE+MOUNT,+MN,+USA?entry=gmail&source=g

Result:

{ "house_number": "13775", "road": "clark rd rose", "city": "mount", "state": "mn", "country": "usa" }

13775 CLARK RD, ROSE MOUNT, MN 55068, USA https://www.google.com/maps/search/13775+CLARK+RD,+ROSE+MOUNT,+MN+55068,+USA?entry=gmail&source=g

Result:

{ "house_number": "13775", "road": "clark rd rose", "city": "mount", "state": "mn", "postcode": "55068", "country": "usa" } https://www.google.com/maps/search/55068%22,%0D%0A++%22country%22:+%22usa%22%0D%0A%7D%0D%0A%0D%0A+13775+CLARK+RD,+ROSE+MOUNT,+MN?entry=gmail&source=g> 13775 CLARK RD, ROSE MOUNT, MN https://www.google.com/maps/search/55068%22,%0D%0A++%22country%22:+%22usa%22%0D%0A%7D%0D%0A%0D%0A+13775+CLARK+RD,+ROSE+MOUNT,+MN?entry=gmail&source=g 55068

Result:

{ "house_number": "13775", "road": "clark rd rose", "city": "mount", "state": "mn", "postcode": "55068" }

  • If the address contains apartment/floor/sub-building information or uncommon formatting, does removing that help? Is there any minimum form of the address that gets the right parse?

13775 CLARK RD, ROSEMOUNT, MN https://www.google.com/maps/search/13775+CLARK+RD,+ROSEMOUNT,+MN?entry=gmail&source=g

Result:

{ "house_number": "13775", "road": "clark rd", "city": "rosemount", "state": "mn" }


Here's what I think could be improved:

Thinking maybe there were some edge cases in the training data where a street name came after a comma? Idrk

— Reply to this email directly, view it on GitHub https://github.com/openvenues/libpostal/issues/675, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF6OZVCI4TWKIINBOQQMP5L2AJQB5AVCNFSM6AAAAABRU4ULQ2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGY2TGMRUGEZTSMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

CamdenParker commented 1 week ago

Worked like a charm. Thank you sir

brianmacy commented 1 week ago

Great to hear. If you interested in any help with ER, let me know :)

On Tue, Nov 12, 2024 at 4:21 PM Camden Parker @.***> wrote:

Worked like a charm. Thank you sir

— Reply to this email directly, view it on GitHub https://github.com/openvenues/libpostal/issues/675#issuecomment-2471608792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF6OZVAL53DM7RJ6DQCT2AL2AJWODAVCNFSM6AAAAABRU4ULQ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZRGYYDQNZZGI . You are receiving this because you commented.Message ID: @.***>