openvenues / pypostal

Python bindings to libpostal for fast international address parsing/normalization
MIT License
767 stars 89 forks source link

One-line pip install of libpostal & pypostal #29

Open ulope opened 6 years ago

ulope commented 6 years ago

The python package isn't installable if libpostal isn't separately installed on the system beforehand. That is quite unexpected for a python package wrapping a C-library.

albarrentine commented 6 years ago

While it's not ideal, it's fairly common for Python packages to have system-wide requirements. For instance, many of the Python GIS packages (Shapely, Rtree, etc.) depend on system libraries that are not bundled and which the user may already have installed for Postgres or something else. Same is true for some of the C-based database bindings which may require libmysqlclient, etc.

For libraries that are intended to be used only with Python, it's more common to bundle the C lib, but that's not the case for libpostal. We also have bindings to Go, Node, Ruby, Java, PHP, R, etc. There's even a Postgres extension.

Libpostal is a bit different from most packages because it features a production-grade, trained machine learning model that takes up about 1.8GB of space at present, which is a lot more than people are accustomed to downloading when installing a package. Because of the heavier-than-usual space requirement, and the fact that many people are using this on AWS, containers, VMs, etc. there's not necessarily a sensible default for where the datadir should go (on AWS machines, the default used in Autotools, "/usr/local/share" might be taking up valuable space on a root volume). Making the Python library fully pip-installable would, I think, involve producing wheels for the various platforms with the compiled libpostal binaries and the models.

The datadir in libpostal is currently set at configure/compile-time. However, all of libpostal's setup functions (called once at import time) have *_setup_datadir variants which allow passing in a directory at runtime. As such, it should be possible to add wheel distributions which bundles libpostal and the libpostal_data script which downloads the data files (and gets installed in e.g. /usr/local/bin by default), configures a datadir on the Python side, and then use the configured datadir at import time. At minimum though, the default behavior would need to check for an existing libpostal installation so the user doesn't inadvertently download the model twice if they already have libpostal or one of the other bindings installed.

Happy to accept pull requests as long as they take into account our various requirements.

ynouri commented 5 years ago

How would the datadir be configured on the Python side? If we take the parser for example, should we make a call to libpostal_setup_parser_datadir(char *datadir) in the init_parser function with datadir read from from an environment variable?

https://github.com/openvenues/pypostal/blob/master/postal/pyparser.c#L175

adriangb commented 2 years ago

Maybe you can take the approach that packages like TensorFlow do to load pertained models? Have a function that initiates the download or something like that. I do think it would be nice to distribute prebuilt binaries.