openvenues / pypostal

Python bindings to libpostal for fast international address parsing/normalization
MIT License
767 stars 88 forks source link

merger with usaddress #16

Open fgregg opened 7 years ago

fgregg commented 7 years ago

hi @thatdatabaseguy ,

At @datamade, we are increasingly in need of a multinational version of usaddress. Now that libpostal has moved to a CRF model, it seems a little silly to not try to combine our efforts. Before we can do that we need

Are these things you would consider? @datamade would do the necessary work to make these happen.

albarrentine commented 7 years ago

Hey @fgregg, that would be an honor! Definitely a fan of @datamade's work around e.g. policing, housing, and racial justice (a set of topics very close to my heart).

No objections to the above from me. As to the specific points:

  1. A pip-only installation would be amazing. I'd liken the way it currently works to, say, lxml requiring libxml2 to be installed separately. Since libpostal has many bindings, we don't bundle it with any of them, though in general it should be rare that people are using libpostal with multiple language bindings (I can see maybe Python + the Postgres extension). The main concern we have that most other libraries do not is the size of the model downloads. The global parser model currently takes up ~1.8GB of disk/memory, so it's often desirable to specify the location rather than have a potentially non-sensible default (the default datadir in Autotools is something like /usr/local/share, but in a setting like AWS, where root volumes are relatively small, it's often better to use a mounted EBS data volume). It should be fine for the pip install to trigger an install of libpostal, but would still be preferable to have the Python binding play nice with a system installation if the user wants that, especially considering the hefty downloads, and being mindful that not all of our users have access to cheap broadband connections or unrestricted downloads.

    It would definitely be possible to create Debian/Red Hat/Homebrew packages. We don't currently support Windows, but I've removed the problematic dependencies, and one user reported getting it working here: https://github.com/BenK10/libpostal_windows. Would ideally like to get compilation to work with Visual Studio and put up an Appveyor build if Windows support is to be a thing. I personally don't have any Windows machines available at the moment. In any case, if you guys are willing to help with packaging that would be awesome! I've been working mostly on core thus far and for installation have just assumed that a standard Autotools source build (configure/make/make install) is familiar enough to most *nix users.

  2. An extended tagset should work without too many changes. The format we use can handle arbitrary tags, especially in terms of the street-level/sub-building components, (though it's best to use our tags or a subset of them for admin components like city/state, etc. which should be compatible with what usaddress uses currently). It might even be possible to derive some OSM-scale training data for the US specifically using the tiger:name* tags in OSM, but that's obviously only available in the US. The tagset for libpostal is sort of the lowest common denominator that works internationally. We do have relatively-comprehensive gazetteers of the street types/directionals, etc. for various languages, but many of said languages have their own variations of the "E St" problem (does it mean "East" or just the letter "E"?). It may also be possible to use some of the sources from OpenAddresses, which often do split out the directional/street type fields as well, and this is true in some countries outside the US as well. They also have more open licensing than OSM, usually CC-BY or similar.
  3. The training data itself would be relatively easy to port. However, there might be a few assumptions that our model makes which would not be as suitable to training on smaller data sets. My goal wasn't really to make a learning framework for address parsing. In fact I usually refer people to usaddress if they want to train their own :). Libpostal was originally more about query analysis i.e. "stick this in front of your geocoder and internationalize all of the things." Nothing about that's incompatible with your use case per se, but our model is definitely built with 10M+ training examples in mind, and there might have to be some tweakable knobs added to feature extraction and the precomputed indices to get it working well on e.g. 10k examples from a user's CSV. A few hyperparameters have been added recently for smaller experiments including number of iterations and a parameter which controls model sparsity. Smaller training sets are also more feasible with the Conditional Random Field as it tends to converge to something reasonable much faster than the greedy averaged perceptron used previously.
stdavis commented 3 years ago

Did anything ever happen with this? We are currently trying to decide between pypostal and usaddress. We need good windows support which seems to be a struggle with pypostal, but it seems like usaddress doesn't have much activity (maybe a dying project?).

nchammas commented 3 years ago

FYI I have done some work towards making pypostal installable just with pip over in #76. I am not sure what direction the project maintainers want to take this in, or if there even is energy for a major packaging change at this stage in the project's lifecycle.

NickCrews commented 2 years ago

Throwing in my 2 cents here in case anyone feels like moving this forward, I like how spacy handles downloading/installing their pre-trained models.