rapidfuzz / Levenshtein

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
https://rapidfuzz.github.io/Levenshtein
GNU General Public License v2.0
267 stars 16 forks source link

Migration to `scikit-build-core` #65

Open LecrisUT opened 2 months ago

LecrisUT commented 2 months ago

It is recommended that projects migrate from scikit-build to scikit-build-core when possible. Most development is focused there and it is a very CMake friendly and pythonic build system. For now I only wanted to create an issue to get your opinion on this and track the progress of such migration. If you have questions or need help, feel free to ask.

maxbachmann commented 2 months ago

I am aware of the scikit-build-core project and long term I will want to migrate to it. However I wanted to wait until it supports all the features required to migrate https://github.com/rapidfuzz/RapidFuzz.

I did have a look at this last year when development on scikit-build-core started. I think the only required feature that's still missing is https://github.com/scikit-build/scikit-build-core/issues/112 which is required for the pure Python fallback.

LecrisUT commented 2 months ago

Thanks for clarifying the blocking feature. I am not sure I understand what is meant with run_setup(true/false). Is it to disable the cmake build part? Can you walk me through how you use that?

Have you also checked the Overrides feature? Might be close to what you need?

I can also share my project for design reference. It's basically split in CMake subprojects so that it can build as a monolith, or individual components, like the python interface only.

maxbachmann commented 2 months ago

https://github.com/rapidfuzz/RapidFuzz/blob/main/setup.py is the setup script

This behaves in the following way: 1) when packaging it should always build the C-Extension and fail hard if it fails to build since that's either a bug or an error in the build environment. This is done by detecting the build environment based on some environment variables. As mentioned in the scikit-build-core issue this part could be achieved using overrides

2) when building outside of a packaging environment it should attempt to build the C-Extension but fall back to a pure Python implementation if the installation fails and print a warning. This doesn't affect most users, since I do provide wheels for most platform

I don't believe 2) can be implemented using scikit-build-core so far.

In addition one pre-requirement for 2) is that cmake and ninja are only installed using pip if:

I currently implement this pre-requirement in https://github.com/rapidfuzz/RapidFuzz/blob/main/_custom_build/backend.py. Not sure whether this is already supported by scikit-build-core. I know @henryiii did add at least some detection logic for available wheels at some point as well: https://github.com/scikit-build/scikit-build-core/blob/main/src/scikit_build_core/resources/known_wheels.toml.

LecrisUT commented 2 months ago

I don't believe 2) can be implemented using scikit-build-core so far.

Couldn't this be controlled within CMake, e.g. adding an advanced option ALLOW_FALLBACK (normally I insist projects to add namespaced options, but for pip installed projects only I concur). This can then be controlled via overrides, although if it's strictly limited to PyPI packaging, I would just add appropriate --config-settings option to pip. One great point of scikit-build-core design is that it is very close to native CMake and Python so anything can be designed with enough CMake grease.

In addition one pre-requirement for 2) is that cmake and ninja are only installed using pip if:

  • there is no sufficient version installed on the system
  • there is a wheel available

The default is to try the system first and then fallback to pip. Version checking is implemented. Changing the preference order would be tricky for packagers.

maxbachmann commented 2 months ago

The default is to try the system first and then fallback to pip. Version checking is implemented. Changing the preference order would be tricky for packagers.

Yes but this breaks in the following case:

In this case I still want the fallback version to be installed. This pretty much rules out any solutions that are implemented in cmake.

Couldn't this be controlled within CMake, e.g. adding an advanced option ALLOW_FALLBACK (normally I insist projects to add namespaced options, but for pip installed projects only I concur).

Maybe that would be possible. I don't know how I can allow python_add_library to fail without an error though.

although if it's strictly limited to PyPI packaging, I would just add appropriate --config-settings option to pip

I am not familiar with this option. So not sure how it would help me in my particular case.

LecrisUT commented 2 months ago

In this case I still want the fallback version to be installed

Aaah, now I understood. Wow that's a tricky set of conditions to support. Kudos for the extra consideration put. In that case I am truly lost. How are the pure-python files supposed to work in that case, maybe there is a design using multi-python project like hatch + hatchling have in their repository. I was considering adding that for my project as well, but didn't get around to it, so I don't have clear design to recommend.

how I can allow python_add_library to fail without an error though.

Why would it fail? This is defined in FindPython and it's just a wrapper of add_library. If you have an example of that failure can you share, and I'll try to forward it to upstream CMake?

although if it's strictly limited to PyPI packaging, I would just add appropriate --config-settings option to pip

I am not familiar with this option. So not sure how it would help me in my particular case.

Here's an example. Hopefully the naming is enough to indicate what it does in that case.

maxbachmann commented 2 months ago

How are the pure-python files supposed to work in that case, maybe there is a design using multi-python project like hatch + hatchling have in their repository.

I install the Python and C implementations side by side and import the fastest version available. You can see the implementation here: https://github.com/rapidfuzz/RapidFuzz/blob/8af875c1227ef0a034f46a1d6d46ab35c188d899/src/rapidfuzz/_utils.py#L109 and https://github.com/rapidfuzz/RapidFuzz/blob/main/src/rapidfuzz/fuzz.py

Essentially when the library is imported I try the following imports:

If the user specifically wants to get the C++ or Python variant he can override the behaviour by setting the RAPIDFUZZ_IMPLEMENTATION environment variable before importing the library.

This allows everyone to get as fast of an implementation as possible without losing the compatibility provided by a pure Python implementation. Most users will get the C++ implementation through binary wheels or their package manager anyway. So this really only affects more exotic targets.

The downsides of the implementation are:

Why would it fail? This is defined in FindPython and it's just a wrapper of add_library. If you have an example of that failure can you share, and I'll try to forward it to upstream CMake?

I just assumed that's what's registering it to be compiled as well. All I really wanted to say is that I don't know how I would mark this using an ALLOW_FALLBACK option so it doesn't fail on compilation failure. Really doesn't matter too much though since the fact that this requires cmake already rules this out

LecrisUT commented 2 months ago

I install the Python and C implementations side by side and import the fastest version available.

Interesting, it's different to my situation when I want to link to either system or python bundled library. In your case I would definitely split into 2 packages making the pure python an optional dependency but preferred in import workflow (just to give the user maximum control). Dealing with AVX2 preference would be tricky though, didn't think far enough around that option. But even without changing the order I think it would be a good design to split in 3 packages: interface, compiled, pure-python (or combine interface and pure-python if your fallback allows it). It would also make it clearer for the user what version they want to use. Don't know if there is an appropriate way to use markers?

packaging tools that create a single binary with Cpython and all dependencies injected like pyinstaller won't find what they have to include

Didn't know about pyinstaller, and indeed it seems fairly dangerous. That seems like a nightmare to support even in pure-python.

I just assumed that's what's registering it to be compiled as well

Ah, I thought you were concerned about configure stage failures. If it's build stage, you have a bit more control on scikit-build-core if you filter specific build targets.

henryiii commented 2 months ago

I think this can be supported with a few more override settings. We could provide:

Should we allow failed to take the condition of failure? I think the conditions would be "configure", "build", and "install". Though maybe it's better to just wait until someone asks for the ability to differentiate?

The cmake one would trigger if you were not on a platform with a known cmake wheel and there's no sufficient CMake present, or always if there's no system CMake. If this matches, then it affects if cmake is requested, as well.

This is what pure-python on non-supported CMake system would look like:

[[tool.scikit-build.overrides]]
if.no-system-cmake = "known-wheel"
wheel.cmake = false

Or if the build fails:

[[tool.scikit-build.overrides]]
if.failed = true
wheel.cmake = false

Custom message on failure:

[[tool.scikit-build.overrides]]
if.failed = true
error-message = "The build failed!"

I think this would add a lot of flexibility; for an unrelated example, you could fail on unsupported Pythons:

[[tool.scikit-build.overrides]]
if.python-version = ">=3.13"
error-message = "Python 3.13+ is not supported yet!"

I'm looking at a way to allow the cmake wheel to trigger the bootstrapping feature, and I think this would be a step in that direction too.

This is a very rough draft of what I'm thinking.

maxbachmann commented 2 months ago

I assume with this I could implement my handling as:

[tool.scikit-build]
wheel.cmake = true

[[tool.scikit-build.overrides]]
if.no-system-cmake = "known-wheel"
wheel.cmake = false

[[tool.scikit-build.overrides]]
if.failed = true
wheel.cmake = false

[[tool.scikit-build.overrides]]
if.any.env.CIBUILDWHEEL = "1"
if.any.env.CONDA_BUILD = "1"
if.any.env.PIWHEELS_BUILD = "1"
if.any.env.RAPIDFUZZ_BUILD_EXTENSION = "1"
wheel.cmake = true

# whether this is required would probably depend on
# 1) is the override above triggered both for the failure + non failure case
# 2) what does the failure handling do if the same config is set both for failure and non failure
[[tool.scikit-build.overrides]]
if.failed = true
if.any.env.CIBUILDWHEEL = "1"
if.any.env.CONDA_BUILD = "1"
if.any.env.PIWHEELS_BUILD = "1"
if.any.env.RAPIDFUZZ_BUILD_EXTENSION = "1"
error-message = "failed to build C++ Extension in a packaged build"
henryiii commented 2 months ago

I think we should make the error-message only appear if the build fails, and add a fail = true option. Otherwise, looks good, I think?

maxbachmann commented 2 months ago

Yes I think this would cover all of my needs and would be a lot more concise than my current implementation with scikit-build :+1:

I will start a branch porting over the existing set of features and extend it as things get added to scikit-build-core

henryiii commented 2 months ago

FYI,

if.any.env.CIBUILDWHEEL = true

Will do truthy/falsey values.

henryiii commented 2 months ago

Any ideas for better naming on if.no-system-cmake = "known-wheel"? I don't think that reads very well.

LecrisUT commented 2 months ago

No clue from my side. To me if.system-cmake reads like "did we use the cmake from system". Maybe if.cmake-vendor can be a more general form of that, although vendor sounds wrong to me here. And maybe if.have-cmake as a catch-all for if any cmake is found?

henryiii commented 2 months ago

To me if.system-cmake reads like "did we use the cmake from system".

That great, that's what it is supposed to be, except I'm not sure how to indicate the "and there's no known wheel for this platform" condition. if.no-system-cmake = "and-no-known-wheel"?

LecrisUT commented 2 months ago

To me if.(no-)system-cmake reads more like it would be a bool and would evaluate to (true)false when a wheel is used. if.have-cmake would be more explicit of no system cmake and no wheel.