ramanshahdatascience / tshirts

The Bayesian t-shirts: a taste of optimal inventory
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Maintainable international address parsing #4

Closed ramanshah closed 2 years ago

ramanshah commented 2 years ago

Currently, I'm converting free-form US addresses to Shippo data using mucky regexes and a whole mess of if-else logic. I'm presently adding UK address parsing to this, growing the complexity. This won't be maintainable as a third country comes in.

The parsing logic in a human's head does a bunch of lifting:

It would be overkill but a learning project to graduate from to a principled parsing of an address string in the inventory spreadsheet to a row of Shippo CSV. (That's on brand: this whole repo is overkill but a learning project!)

Some resources:

https://tomassetti.me/parsing-in-python/

ramanshah commented 2 years ago

At the cost of complexity (separately building a C library) and bulk (the library seems to involve a 1.8 GB machine learning model), the canonical solution might be https://github.com/openvenues/pypostal, which implements a globally tested address parser using NLP techniques.

ramanshah commented 2 years ago

I've added UK and IE by now, and will soon add FR. The little abstraction I built for this feels adequate.

ramanshah commented 2 years ago

Reopening - French addresses (for which the postal code lives before the city) don't hold up to the brittle tree I've built, and I'm starting to think I need fancier methods, like a bona fide tokenizer and parser.