kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

Add standalone lib build with Python install #417

Closed jacobkahn closed 1 year ago

jacobkahn commented 1 year ago

Augments the Python build to also ship C++ library builds of KenLM that are dynamically usable by other C++ libs. Specifically, pip install . in the repo (or pip install git+https://github.com/kpu/kenlm, as it'll likely be most often used) now installs:

In particular, this PR:

jacobkahn commented 1 year ago

@kpu this should drastically reduce the number of "how do I install KenLM, I'm trying to do X" issues given that a lot of those downstream needs are coming from Python packages requiring it. Hope to merge this shortly -- let me know if you have any questions.

ZJaume commented 1 year ago

@jacobkahn had you tested non-default max order like --install-option="--max_order 7" in the python installation? Build is now taking terrribly long for me. Also people is complaining about not being compiled correctly, but I'm not even able to check because it takes a lot of time to build.

May I suggest the use of scikit-build?

jacobkahn commented 1 year ago

@ZJaume thanks for letting me know. This change doesn't affect the default Python library build at all, so all of those build options should stay exactly the same... I'll test those again now and see what's going on.

With respect to build speed, this quite literally performs the identical build twice once using the CMake version in order to produce dylib that C++ binaries can consume, so I'm not sure why build speed would suffer so significantly.

I'll spend some time on fixing this right now given those issues; I'll give scikit build a try as well. I'll tag you in the fix.

jacobkahn commented 1 year ago

@ZJaume — after further investigation, the increase in build time is due to the confluence of multiple things:

  1. The use of pyproject.toml, as stipulated in PEP 517 and PEP 518, means builds occur in an isolated environment. This means that build dependencies need to be installed, which will increase the build time slightly as wheels for CMake and setuptools need to be downloaded. Isolated builds are the new standard in Python/pip, and other approaches to building will be deprecated. See related documentation.
  2. Configuring via argv/--install-option for pip builds is behavior that precludes using caching of build dependencies (https://github.com/pypa/pip/issues/11358), and is also not recommended. I was able to reproduce a significant build slowdown when using those options waiting for build dependencies to be installed.

In https://github.com/kpu/kenlm/pull/420, I've changed the behavior to read from a MAX_ORDER environment variable, which is a simpler alternative to --install-option-based configuration. I've updated the documentation accordingly, which didn't document the --max-order install option as it was. If this breaks downstream dependencies, they should swap to using a PEP 517-compliant way of building and installing KenLM via pip. Let me know if you have any other problems here if you decide to switch away from the old commit and onto master.

I explored scikit-build briefly (https://github.com/jacobkahn/kenlm/tree/scikit_build_python), but had some problems. We can revisit this down the line.

cc @kpu