Closed ramanshah closed 2 years ago
At the cost of complexity (separately building a C library) and bulk (the library seems to involve a 1.8 GB machine learning model), the canonical solution might be https://github.com/openvenues/pypostal, which implements a globally tested address parser using NLP techniques.
I've added UK and IE by now, and will soon add FR. The little abstraction I built for this feels adequate.
Reopening - French addresses (for which the postal code lives before the city) don't hold up to the brittle tree I've built, and I'm starting to think I need fancier methods, like a bona fide tokenizer and parser.
Currently, I'm converting free-form US addresses to Shippo data using mucky regexes and a whole mess of if-else logic. I'm presently adding UK address parsing to this, growing the complexity. This won't be maintainable as a third country comes in.
The parsing logic in a human's head does a bunch of lifting:
It would be overkill but a learning project to graduate from to a principled parsing of an address string in the inventory spreadsheet to a row of Shippo CSV. (That's on brand: this whole repo is overkill but a learning project!)
Some resources:
https://tomassetti.me/parsing-in-python/