Open ulope opened 6 years ago
While it's not ideal, it's fairly common for Python packages to have system-wide requirements. For instance, many of the Python GIS packages (Shapely, Rtree, etc.) depend on system libraries that are not bundled and which the user may already have installed for Postgres or something else. Same is true for some of the C-based database bindings which may require libmysqlclient, etc.
For libraries that are intended to be used only with Python, it's more common to bundle the C lib, but that's not the case for libpostal. We also have bindings to Go, Node, Ruby, Java, PHP, R, etc. There's even a Postgres extension.
Libpostal is a bit different from most packages because it features a production-grade, trained machine learning model that takes up about 1.8GB of space at present, which is a lot more than people are accustomed to downloading when installing a package. Because of the heavier-than-usual space requirement, and the fact that many people are using this on AWS, containers, VMs, etc. there's not necessarily a sensible default for where the datadir should go (on AWS machines, the default used in Autotools, "/usr/local/share" might be taking up valuable space on a root volume). Making the Python library fully pip-installable would, I think, involve producing wheels for the various platforms with the compiled libpostal binaries and the models.
The datadir in libpostal is currently set at configure/compile-time. However, all of libpostal's setup functions (called once at import time) have *_setup_datadir
variants which allow passing in a directory at runtime. As such, it should be possible to add wheel distributions which bundles libpostal and the libpostal_data
script which downloads the data files (and gets installed in e.g. /usr/local/bin by default), configures a datadir on the Python side, and then use the configured datadir at import time. At minimum though, the default behavior would need to check for an existing libpostal installation so the user doesn't inadvertently download the model twice if they already have libpostal or one of the other bindings installed.
Happy to accept pull requests as long as they take into account our various requirements.
How would the datadir be configured on the Python side? If we take the parser for example, should we make a call to libpostal_setup_parser_datadir(char *datadir)
in the init_parser
function with datadir
read from from an environment variable?
https://github.com/openvenues/pypostal/blob/master/postal/pyparser.c#L175
Maybe you can take the approach that packages like TensorFlow do to load pertained models? Have a function that initiates the download or something like that. I do think it would be nice to distribute prebuilt binaries.
The python package isn't installable if
libpostal
isn't separately installed on the system beforehand. That is quite unexpected for a python package wrapping a C-library.