pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.26k stars 1.76k forks source link

`polars` and `polars-lts-cpu` as package dependencies #12880

Open e10v opened 7 months ago

e10v commented 7 months ago

Description

I'm planning to develop a new package which has polars as a dependency. There is also polars-lts-cpu package for old processors. The issue is that they cannot be installed at the same time because of conflicting module name. While one module name is convenient for project repositories, it's not at all convenient for developing packages which depend on polars.

I've searched and found similar issue on stackoverflow. It seems like the only option is to add both polars and polars-lts-cpu as optional dependencies. That means that if users will have a broken package if they install it without optional dependencies.

Could you reconsider your solution for old processors? Module name polars_lts_cpu would be a better choice from my point of view. Any user could just import it that way:

try:
    import polars_lts_cpu as pl
except except ModuleNotFoundError:
    import polars as pl

And the dependency problem would be solved.

e10v commented 7 months ago

Actually different module name is useful not only for packages but for other projects. Suppose, I have a project repo with polars as a dependency. I can execute it my modern laptop, as well as on server with old Xeon CPUs (which works only with polars-lts-cpu). So I would have the same issue.

EtienneT commented 6 months ago

Makes total sense. I have a similar problem where I deploy to an older production server having a xeon processor. My local computer is more modern and used polars without any problem, but now when deploying in production, I get an exception saying the processor is missing features.

EtienneT commented 5 months ago

I got a new problem related to this. As soon as you include a package that lists polars as one of it's dependencies in your own dependencies, this is going to cause problems.

I had dagster-duckdb-polars package in my projet dependencies. dagster-duckdb-pandas has a direct dependency on polars[pyarrow].

To get around the problem of polars vs polars-lts-cpu in production, I had defined [project.optional-dependencies] in my pyproject.toml, and I was installing polars only in dev and polars-lts-cpu only in production.

But as soon as I included dagster-duckdb-polars, since it depends on polars[pyarrow], it would also install polars in production no matter what optional dependencies I told it to install making my project crash in production. Not too sure what would be the best way to address that.

tim-stephenson commented 5 months ago

Also running into this issue. Listing polars as a dependency in a package seems to be an oversight based off of how different x86 binaries for newer/older cpu's are given completely different package names. Potentially the answer from the polars team is that you shouldn't make python packages which depend on polars, only using polars as an end-user. Or you should only compile from source when deploying your package onto an old Xeon server.

The odd thing is it seems other libraries (e.g. numpy) have solved this problem through runtime-dispatching. There is likely a reason I do not understand as to why this solution was not advanced.

Another solution which is out of scope from polars (more of a python software foundation request) is adding cpu feature flags to PEP 508. I personally do not like this idea, as I find the solution of making multiple packages for different minimum cpu extensions to add a lot of complexity for a package which depends on polars.

My preferred solution to having a package support a range of cpu features would be to make different wheel files with different 'platform tags' for different levels of cpu extensions. PEP 491 and documentation on platform compatibility tags give a guide to the current state of how this works, and what would need to change to support this. Essentially the 'platform tags' would need to allow for more specificity than just cpu architecture + OS, also including flags for required cpu architecture extensions. The benefit of this solution over runtime dispatching is smaller binaries for end-users.

One could see this problem (older cpu's which do not support the minimum cpu extensions) as one slowly going away as the number of x86 cpu's without the minimum x86 extensions is diminishing by the day. However, I would guess this will likely be a continuous problem, as computing architectures continue to add extensions to archive greater performance and software libraries aim to conditionally use such extensions based on their presence.

lmmx commented 3 months ago

Potentially the answer from the polars team is that you shouldn't make python packages which depend on polars, only using polars as an end-user.

I agree that this is the current state of the "installation story" for Polars, but it's not made clear (and as I've seen noted elsewhere, to do so would mean that installing such a package would result in non-functional software, i.e. not "batteries included").

This is rather a headache for portability 😞

stinodego commented 3 months ago

Just commenting to let you know we are aware of this issue. However, there is not a clear, easy solution here. Therefore, this has low priority for us right now.

Package authors with a Polars dependency can choose to compile multiple versions of their package just like Polars does. We realize this is not great, but it's probably your best bet if you want to support old CPUs.

thomasaarholt commented 3 months ago

I think this is ideally fixed upstream as a PEP. I've encountered this before on other packages.

What do you think of proposing a default optional dependency so that the following holds true:

pip install my-package -> pip install 'my-package[cpu]' pip install 'my-package[lts-cpu]' -> pip install 'my-package[lts-cpu]' and not [cpu]

[project]
name = "my-package"
description = "A package that depends on polars, but wants support for the regular and lts cpu versions"
default-optional-dependency = "cpu"

[project.dependencies]
# None

[project.optional-dependencies]
cpu = ["polars"]
lts-cpu = ["polars-lts-cpu"]
noirbizarre commented 3 months ago

Hey 👋🏼

We had the same discussion on pdm.

I just share a possible way of handling this (copy-pasted from my post in the discussion)

IMHO, the proper solution for polars would be:

It can be something basic like:

try:
   import polars_lts_cpu.feature as feature
except ImportError:
   import polars.feature as feature

def public_function():
    return feature.do_something()

[!NOTE] I believe it's even possible to use importlib.metadata and entrypoints to avoid try/except and be able to be more dynamic on support loading at runtime

This way, package depending on polars just need to add polars as dependency. End user installations can specify polars, polars[lts-cpu] or polars[u64-idx], it will always be resolved as polars but will install the expected extra CPU support and properly resolve it at runtime.

douglas-raillard-arm commented 2 months ago

Another (untested) option might be to keep CPU-specific versions of polars as "implementation" packages and make a single polars package that is only distributed as a source distribution in PyPI (not as a wheel since wheels pre-run setup.py and do not include it in the package). Its setup.py can dynamically detect CPU features and make itself depend on the correct internal implementation. The main polars package can then simply re-export the implementation package that was installed.

If for some reason a user requires a specific implementation, they can always depend on it directly and use the corresponding imports rather than using the generic polars shim.

Handling polars-u64-idx can also be done via e.g. polars[u64-idx], so that setup.py becomes the central decision point that can raise if the combination that is required does not exist.

tim-stephenson commented 2 months ago

Another (untested) option might be to keep CPU-specific versions of polars as "implementation" packages and make a single polars package that is only distributed as a source distribution in PyPI (not as a wheel since wheels pre-run setup.py and do not include it in the package). Its setup.py can dynamically detect CPU features and make itself depend on the correct internal implementation. The main polars package can then simply re-export the implementation package that was installed.

If for some reason a user requires a specific implementation, they can always depend on it directly and use the corresponding imports rather than using the generic polars shim.

Handling polars-u64-idx can also be done via e.g. polars[u64-idx], so that setup.py becomes the central decision point that can raise if the combination that is required does not exist.

Industry standard is to provide binaries for any major python package. The user expectation is that “pip install polars”, “just works” on Mac, Linux, and Windows, for x86 & arm.

Users can already do what you are describing (build binaries themselves specific to their os/cpu instruction set)

douglas-raillard-arm commented 2 months ago

The only part that would be shipped as sdist is the shim. The implementation packages would obviously come prebuilt in this scenario, there is no reason to let the user deal with that.

EDIT:

Users can already do what you are describing (build binaries themselves specific to their os/cpu instruction set)

This is not what I'm describing. The proposition includes the following packages:

The polars top-level shim would not be shipped as a wheel, to preserve the execution of setup.py on the user machine. That setup.py would detect what instructions the machine support, pick one of the appropriate implementation packages and depend on it.

In that scenario, I think it's a better idea if each of the "implementation packages" use different import names (e.g. import polars_lts_cpu) and let the shim package re-export it. The shim can have a priority order from most specialized to least specialized implementations, and/or an env var to let the end user pick one of the installed implementations. That would allow users who really need direct access to the implementation to get it, and would allow multiple implementations to be installed at once rather than overwriting files and letting pip confused. As it stands, what you get by import polars after having run both pip install polars and pip install polars-lts-cpu depends on the state of the venv before running those commands, which is clearly not good.