jermp / tongrams

A C++ library providing fast language model queries in compressed space.
MIT License
128 stars 20 forks source link

Lot of fixes, both in CMake and in python packaging #17

Closed KOLANICH closed 3 years ago

KOLANICH commented 3 years ago

If you don't like any of the commits, it'd be faster if you landed the ones you like first than delaying landing of whole bunch because of discussing a single commkt. I'd rebase the branch upon master.

jermp commented 3 years ago

Thanks for your interest in Tongrams! A few observations your commits:

jermp commented 3 years ago
  • I am not a python expert. How do your commits improve upon the current python wrapper? In other words, do we gain something at all (apart from moving things around)?

For example, would then be possible to just do: pip install tongrams after you commits? Thanks. -G

KOLANICH commented 3 years ago

I don't like relying on external sources for CMake configuration for simple things like compiling with sanitizers.

Would it be OK if I just remove the submodules and just add their contents as regular dirs? (Please note that this way is much less convenient to upgrade and isn't very different from using submodules).

I think a couple of compiler flags are more than enough.

Compiler flags are implementation details of compilers. Using them directly would either lock your project only to the compilers supporting them, or would garbage the source of this project with the code that currently lives within the modules.

How do your commits improve upon the current python wrapper?

Currently the commits in this PR don't improve the wrapper and the library themselves. Currently they only improve building and packaging process. In fact, not fully: proper fix would require changes in some of the dependencies used as a submodule (removal of the hardcoded flag that breaks build on my machine).

In other words, do we gain something at all (apart from moving things around)?

No, it is just moving the things around:

  1. Moving metadata into the declarative config (setup.cfg) currently doesn't bring any large benefits (it brings a minor benefit of the declarative config being machine-readable and -editable) of improved security (we still have to have setup.py, since the declarative config doesn't support cexts), but it is a step into that direction.
  2. just adding pyproject.toml shouldn't cause any big difference. It'd just cause setuptools and pip to build in PEP 517 mode, if they support it (if don't (extremily old versions), they'd work as they used to work). But pyproject.toml seems like a consensual way for declarative metadata, increasing number of setuptools plugins allow storing their metadata there.
  3. Among such tools is setuptools_scm. It fetches the info about versions (both tag and commit hash) from version control system history and writes it into metadata. Git is supported. It is useful, because a. it follows DRY principle, there is a single place from which version info originates; b. it records commit hash into version, so in the case of troubleshooting issues with pre-release versions a user can type pip list | grep tongrams and get the commit hash of the version he has installed, even if he has deleted the original repo. I.e. it may be useful for git bisect.

For example, would then be possible to just do: pip install tongrams after you commits?

Should be, but I haven't tested that (properly testing that would require setting up an own private PyPI instance). I usually stay away of PyPI and install python packages by git clone ...., then python3 -m build -nwx . (-n disables build isolation - a cargo-cult within python community, -w builds a wheel, -x disables build's own dependency checking and installation, because I manage all the dependencies manually and don't want build to try to install anything without my explicit decision to install that) to build a wheel (a binary package), then sudo pip3 install --upgrade ./dist/*.whl to install the wheel.

Important note - users' pips usually should download binary wheels from PyPI, not sdists. If they download sdists and build them, it means there is no binary packages on PyPI that suits them.

jermp commented 3 years ago

Thank you very much for your detailed answer. I will then read carefully all the comments and get back to you. For the moment, what I would like the most is to let Tongrams be as much easy as possible to use for Python users. To do so, we could work on making it possible to install Tongrams via pip as I mentioned earlier without worrying about C++ and dependencies. Do you think it is possible?

KOLANICH commented 3 years ago

To do so, we could work on making it possible to install Tongrams via pip as I mentioned earlier without worrying about C++ and dependencies. Do you think it is possible?

Only as prebuilt wheels, I guess. It is possible to automate building the cext from sdists using https://github.com/raydouglass/cmake_setuptools, but ... installing Boost will still be the major inconvenience. Again, it may be possible to even fetch the latest version of Boost fully automatically .... but it is likely it would be slow and redundant.

So, the best way is to ship prebuilt wheels.

Also, it may make sense to refactor the module the following way:

  1. build a shared lib
  2. use the lib using ctypes.

This would make it more compatible with python implementations other than cpython and also would eliminate the need of rebuilding of the extension for each python version.

jermp commented 3 years ago

Many thanks @KOLANICH ! I can confirm that now it is possible to install the library via pip using a pre-built wheel. I have tested it using pypi.test (https://test.pypi.org/project/tongrams/) and a virtual env. Next, I am going to use the "real" pypi :)