Open MarcoGorelli opened 2 months ago
I was thinking https://github.com/altair-viz/vega_datasets might be a good candidate for experimenting with this?
vega_datasets
hasn't been updated in a while, and looking at the code, it seems tightly coupled to pandas
.
However we're using it for a lot of the examples e.g. https://altair-viz.github.io/gallery/bar_faceted_stacked.html
Since altair
has been narwhalified, and there are a few different file formats across 70 datasets - it could prove a good testing resource - if nothing else
yup, definitely, thanks @dangotbanned !
I think it's a nice idea, and it would apply to cases where your default backend is for instance pandas
, but you wanna use polars
if it exists. But if the devs have control over what the dependency should be, then this doesn't change much.
For instance, if we change our dependency from pandas
to polars
, then this doesn't really add much, does it?
The use case I was thinking of was that a Polars user could use Fairlearn without needing to install pandas (nor probably PyArrow - maybe not in 2024, but I'd be surprised if it didn't become a pandas required dependency eventually)
So imagine you're a library that requires some sort of a dataframe backend. It would be very impractical to not have any default dataframe backend in your install_requires
. That means the package maintainers need to decide on that dependency. Unless something like this at some point goes through: https://discuss.python.org/t/conditional-package-install-depending-on-other-packages-in-environment/4140/4
On the other hand, as the package maintainers we could make narwhals
the dependency, but that would only work if narwhals
had a default dependency if nothing was installed, which is not something narwhals
should do IMOI.
So basically, on a fresh env, pip install fairlearn
should result in an environment which can run fairlearn
.
So basically, on a fresh env, pip install fairlearn should result in an environment which can run fairlearn.
interesting, thanks...and it's kinda unfortunate that there's not a way to do "pip install minus", so that pip install fairlearn
would by default come with some backend (say pandas) but people could choose to install it without it if necessary
🤔 not really sure if this could work at all in Fairlearn then? Unless your only reason for using it was a stable api across pandas versions 😉
So I think on the fairlearn
side, it makes sense to add narwhals
and polars
as dependencies and remove pandas
, and the benefit of this proposal would be to kinda never touch the polars
interface, and if ever there's a better backend than polars
installed, it would automatically be used.
cc @tamaraatanasoska
that could make sense, thanks for explaining
just out of interest, if you're choosing a backend, why not use ibis? presumably it's that they still require pyarrow for all backends? if so, a feature request to them to not require pyarrow for all backends would probably be a good idea 😉
Never really gave ibis
a thought, and def don't want anything to do with pyarrow as a dependency 😁
@adrinjalali can you explain the rationale behind:
don't want anything to do with pyarrow as a dependency
@lostmygithubaccount pyarrow has had an interesting history, and my personal decision for not wanting to depend on it is multi faceted.
I remember in 2018-2019 being quite excited about it. We had discussions about supporting it in scikit-learn, and at some point I was happy to even work on the library.
But then it became a massive piece of software which is a kind of dependency that you'd only want to add if you really have to. Otherwise rather keep the environments lightweight.
At least on pypi
, the only dependency is numpy
, but the package itself is 40MB. However, on conda-forge
, this is what happens if you try to install pyarrow
on a fresh env:
$ mmamba install pyarrow
conda-forge/noarch 16.4MB @ 4.2MB/s 4.0s
conda-forge/linux-64 38.0MB @ 5.2MB/s 7.5s
Transaction
Prefix: /home/adrin/micromamba/envs/delete
Updating specs:
- pyarrow
Package Version Build Channel Size
─────────────────────────────────────────────────────────────────────────────────────────────────
Install:
─────────────────────────────────────────────────────────────────────────────────────────────────
+ libstdcxx 14.1.0 hc0a3c3a_1 conda-forge 4MB
+ libutf8proc 2.8.0 h166bdaf_0 conda-forge Cached
+ c-ares 1.33.1 heb4867d_0 conda-forge 183kB
+ libssh2 1.11.0 h0841786_0 conda-forge Cached
+ keyutils 1.6.1 h166bdaf_0 conda-forge Cached
+ aws-c-common 0.9.28 hb9d3cd8_0 conda-forge 236kB
+ s2n 1.5.2 h7b32b05_0 conda-forge 352kB
+ libbrotlicommon 1.1.0 hb9d3cd8_2 conda-forge 69kB
+ python_abi 3.12 5_cp312 conda-forge 6kB
+ libiconv 1.17 hd590300_2 conda-forge Cached
+ libevent 2.1.12 hf998b51_1 conda-forge Cached
+ libev 4.33 hd590300_2 conda-forge Cached
+ libgfortran5 14.1.0 hc5f4f2c_1 conda-forge 1MB
+ libedit 3.1.20191231 he28a2e2_2 conda-forge Cached
+ libstdcxx-ng 14.1.0 h4852527_1 conda-forge 52kB
+ aws-checksums 0.1.18 h756ea98_11 conda-forge 50kB
+ aws-c-cal 0.7.4 hfd43aa1_1 conda-forge 48kB
+ aws-c-compression 0.2.19 h756ea98_1 conda-forge 19kB
+ aws-c-sdkutils 0.1.19 h756ea98_3 conda-forge 56kB
+ libbrotlienc 1.1.0 hb9d3cd8_2 conda-forge 282kB
+ libbrotlidec 1.1.0 hb9d3cd8_2 conda-forge 33kB
+ libgfortran 14.1.0 h69a702a_1 conda-forge 52kB
+ icu 75.1 he02047a_0 conda-forge 12MB
+ lz4-c 1.9.4 hcb278e6_0 conda-forge Cached
+ gflags 2.2.2 he1b5a44_1004 conda-forge Cached
+ libnghttp2 1.58.0 h47da74e_1 conda-forge Cached
+ krb5 1.21.3 h659f571_0 conda-forge 1MB
+ libthrift 0.20.0 h0e7cc3e_1 conda-forge 417kB
+ libabseil 20240116.2 cxx17_he02047a_1 conda-forge 1MB
+ libcrc32c 1.1.2 h9c3ff4c_0 conda-forge Cached
+ zstd 1.5.6 ha6fb4c9_0 conda-forge Cached
+ snappy 1.2.1 ha2e4443_0 conda-forge 42kB
+ aws-c-io 0.14.18 hc2627b9_9 conda-forge 159kB
+ libgfortran-ng 14.1.0 h69a702a_1 conda-forge 52kB
+ libxml2 2.12.7 he7c6b58_4 conda-forge 707kB
+ glog 0.7.1 hbabe93e_0 conda-forge 143kB
+ libre2-11 2023.09.01 h5a48ba9_2 conda-forge Cached
+ libprotobuf 4.25.3 h08a7969_0 conda-forge Cached
+ libcurl 8.10.0 hbbe4b11_0 conda-forge 425kB
+ aws-c-event-stream 0.4.3 h235a6dd_1 conda-forge 54kB
+ aws-c-http 0.8.9 h5e77a74_0 conda-forge 198kB
+ libopenblas 0.3.27 pthreads_hac2b453_1 conda-forge Cached
+ re2 2023.09.01 h7f4b329_2 conda-forge Cached
+ orc 2.0.2 h669347b_0 conda-forge 1MB
+ azure-core-cpp 1.13.0 h935415a_0 conda-forge 338kB
+ aws-c-mqtt 0.10.5 h0009854_0 conda-forge 194kB
+ aws-c-auth 0.7.30 hec5e740_0 conda-forge 107kB
+ libblas 3.9.0 23_linux64_openblas conda-forge Cached
+ libgrpc 1.62.2 h15f2491_0 conda-forge Cached
+ azure-identity-cpp 1.8.0 hd126650_2 conda-forge 200kB
+ azure-storage-common-cpp 12.7.0 h10ac4d7_1 conda-forge 143kB
+ aws-c-s3 0.6.5 hbaf354b_4 conda-forge 113kB
+ libcblas 3.9.0 23_linux64_openblas conda-forge Cached
+ liblapack 3.9.0 23_linux64_openblas conda-forge Cached
+ libgoogle-cloud 2.29.0 h435de7b_0 conda-forge 1MB
+ azure-storage-blobs-cpp 12.12.0 hd2e3451_0 conda-forge 523kB
+ aws-crt-cpp 0.28.2 h6c0439f_6 conda-forge 350kB
+ numpy 2.1.1 py312h58c1407_0 conda-forge 8MB
+ libgoogle-cloud-storage 2.29.0 h0121fbd_0 conda-forge 782kB
+ azure-storage-files-datalake-cpp 12.11.0 h325d260_1 conda-forge 274kB
+ aws-sdk-cpp 1.11.379 h5a9005d_9 conda-forge 3MB
+ libarrow 17.0.0 hc80a628_14_cpu conda-forge 9MB
+ libarrow-acero 17.0.0 h5888daf_14_cpu conda-forge 608kB
+ libparquet 17.0.0 h39682fd_14_cpu conda-forge 1MB
+ pyarrow-core 17.0.0 py312h9cafe31_1_cpu conda-forge 5MB
+ libarrow-dataset 17.0.0 h5888daf_14_cpu conda-forge 585kB
+ libarrow-substrait 17.0.0 hf54134d_14_cpu conda-forge 550kB
+ pyarrow 17.0.0 py312h9cebb41_1 conda-forge 26kB
Summary:
Install: 93 packages
Total download: 66MB
─────────────────────────────────────────────────────────────────────────────────────────────────
Confirm changes: [Y/n]
As a maintainer of a library which has nothing to do with cloud computing, why on earth would I want to have aws AND azure libraries as transient dependencies? Even if I was doing cloud stuff, I'd probably be working with one of them, not both. That's an insane number of bloatware installed when pulling pyarrow
from conda-forge
.
On top of that, we've had the time where pyarrow
simply gave up on pypi
and others had to step in, due to C++ compat issues. I understand the challenges of pypi, but this doesn't give me confidence.
And the cherry on top is your employer firing pyarrow maintainers, including some of my friends, who have been working on the project for a while. Not only that doesn't make me want to have the lib as a dependency, it also doesn't give me confidence on the future of the project.
But then it became a massive piece of software which is a kind of dependency that you'd only want to add if you really have to. Otherwise rather keep the environments lightweight.
out of curiosity what's your bar for a lightweight Python environment? some # of MBs?
I don't personally use conda but it does seem like you can get similarly sized installations as PyPI: https://arrow.apache.org/docs/python/install.html#python-conda-differences
I know there are ongoing efforts to reduce the installation size of PyArrow further (and reduce dependencies)
@MarcoGorelli this looks fun :-) Do you think this is ready to accept a PR? May I give it a go during the sprint tomorrow?
Thanks @Cheukting ! I'm not 100% sure about this one, as I'd originally misunderstood the Fairlearn use case. Maybe we can punt on it for the time being
We'll open some issues later which we've reserved specially for the sprint though so there'll be plenty of interesting things to work on 😎
Ok, thanks @MarcoGorelli but in the future if this is needed I am happy to help too
Would it be feasible for from_dict
to optionally accept an existing narwhals DataFrame as an argument, and then use this same backend for the returned DataFrame? Something like:
values_df = nw.from_dict({'a': [1, 2, 3], 'b': [4,5,6]}, like=other_df)
Or perhaps there is a better representation of a backend that could be used instead of the DataFrame itself.
The motivation would be to make it easy to have a function that accepts and returns a Narwhals DataFrame using the same backend, but where the resulting DataFrame isn't computed directly from the input DataFrame.
I can create a separate issue for this, but for my actual usecase in VegaFusion, I'd actually want something like nw.from_arrow_capsule(cap, like=other_df)
. The flow is that I'd like to use Narwhals for basic column projection and schema inspection and then use the Arrow PyCapsule API to pass the result to Rust. Then in some cases, the Rust logic will return a new Arrow result in PyCapsule form, and it would be great to be able to use Narwhals to wrap this result using the same backend as the input.
Hey @jonmmease
If I've understood the request, I think you can do this already:
In [13]: other_df = nw.from_native(pl.DataFrame({'a': [2, 3]}))
In [14]: values_df = nw.from_dict({'a': [1, 2, 3], 'b': [4,5,6]}, native_namespace=nw.get_native_namespace(other_df))
In [15]: values_df
Out[15]:
┌───────────────────────────────────────┐
| Narwhals DataFrame |
| Use `.to_native` to see native output |
└───────────────────────────────────────┘
In [16]: values_df.to_native()
Out[16]:
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 4 │
│ 2 ┆ 5 │
│ 3 ┆ 6 │
└─────┴─────┘
Ah, nice. I had missed the native_namespace
bit. I saw this issue and assumed that from_dict
was new, but now I understand that it's the default behavior. Sorry for the noise!
We could allow users to do something like:
without specifying which backend they want to use. Then, we use whatever they have installed, but with some priority list, like:
If there's demand, we could allow users to customise the priority order (e.g. first try pandas, then Polars, then PyArrow...)
Use case: in Fairlearn, they use dataframes internally to do some calculations, but this is hidden from the user. The user shouldn't care which dataframe Fairlearn uses internally to do those calculations, so long as it's something they have installed
cc @adrinjalali in case you have comments / requests 🙏