narwhals-dev / narwhals

Lightweight and extensible compatibility layer between dataframe libraries!
https://narwhals-dev.github.io/narwhals/
MIT License
545 stars 87 forks source link

feat: allow for creating new_series / from_dict without specifying a backend #876

Open MarcoGorelli opened 2 months ago

MarcoGorelli commented 2 months ago

We could allow users to do something like:

df = nw.from_dict({'a': [1, 2, 3], 'b': [4,5,6]})

without specifying which backend they want to use. Then, we use whatever they have installed, but with some priority list, like:

If there's demand, we could allow users to customise the priority order (e.g. first try pandas, then Polars, then PyArrow...)

Use case: in Fairlearn, they use dataframes internally to do some calculations, but this is hidden from the user. The user shouldn't care which dataframe Fairlearn uses internally to do those calculations, so long as it's something they have installed

cc @adrinjalali in case you have comments / requests 🙏

dangotbanned commented 2 months ago

I was thinking https://github.com/altair-viz/vega_datasets might be a good candidate for experimenting with this?

vega_datasets hasn't been updated in a while, and looking at the code, it seems tightly coupled to pandas. However we're using it for a lot of the examples e.g. https://altair-viz.github.io/gallery/bar_faceted_stacked.html

Since altair has been narwhalified, and there are a few different file formats across 70 datasets - it could prove a good testing resource - if nothing else

MarcoGorelli commented 2 months ago

yup, definitely, thanks @dangotbanned !

adrinjalali commented 2 months ago

I think it's a nice idea, and it would apply to cases where your default backend is for instance pandas, but you wanna use polars if it exists. But if the devs have control over what the dependency should be, then this doesn't change much.

For instance, if we change our dependency from pandas to polars, then this doesn't really add much, does it?

MarcoGorelli commented 2 months ago

The use case I was thinking of was that a Polars user could use Fairlearn without needing to install pandas (nor probably PyArrow - maybe not in 2024, but I'd be surprised if it didn't become a pandas required dependency eventually)

adrinjalali commented 2 months ago

So imagine you're a library that requires some sort of a dataframe backend. It would be very impractical to not have any default dataframe backend in your install_requires. That means the package maintainers need to decide on that dependency. Unless something like this at some point goes through: https://discuss.python.org/t/conditional-package-install-depending-on-other-packages-in-environment/4140/4

On the other hand, as the package maintainers we could make narwhals the dependency, but that would only work if narwhals had a default dependency if nothing was installed, which is not something narwhals should do IMOI.

So basically, on a fresh env, pip install fairlearn should result in an environment which can run fairlearn.

MarcoGorelli commented 2 months ago

So basically, on a fresh env, pip install fairlearn should result in an environment which can run fairlearn.

interesting, thanks...and it's kinda unfortunate that there's not a way to do "pip install minus", so that pip install fairlearn would by default come with some backend (say pandas) but people could choose to install it without it if necessary

🤔 not really sure if this could work at all in Fairlearn then? Unless your only reason for using it was a stable api across pandas versions 😉

adrinjalali commented 2 months ago

So I think on the fairlearn side, it makes sense to add narwhals and polars as dependencies and remove pandas, and the benefit of this proposal would be to kinda never touch the polars interface, and if ever there's a better backend than polars installed, it would automatically be used.

cc @tamaraatanasoska

MarcoGorelli commented 2 months ago

that could make sense, thanks for explaining

just out of interest, if you're choosing a backend, why not use ibis? presumably it's that they still require pyarrow for all backends? if so, a feature request to them to not require pyarrow for all backends would probably be a good idea 😉

adrinjalali commented 2 months ago

Never really gave ibis a thought, and def don't want anything to do with pyarrow as a dependency 😁

lostmygithubaccount commented 1 month ago

@adrinjalali can you explain the rationale behind:

don't want anything to do with pyarrow as a dependency

adrinjalali commented 1 month ago

@lostmygithubaccount pyarrow has had an interesting history, and my personal decision for not wanting to depend on it is multi faceted.

I remember in 2018-2019 being quite excited about it. We had discussions about supporting it in scikit-learn, and at some point I was happy to even work on the library.

But then it became a massive piece of software which is a kind of dependency that you'd only want to add if you really have to. Otherwise rather keep the environments lightweight.

At least on pypi, the only dependency is numpy, but the package itself is 40MB. However, on conda-forge, this is what happens if you try to install pyarrow on a fresh env:

$ mmamba install pyarrow
conda-forge/noarch                                  16.4MB @   4.2MB/s  4.0s
conda-forge/linux-64                                38.0MB @   5.2MB/s  7.5s

Transaction

  Prefix: /home/adrin/micromamba/envs/delete

  Updating specs:

   - pyarrow

  Package                                  Version  Build                Channel           Size
─────────────────────────────────────────────────────────────────────────────────────────────────
  Install:
─────────────────────────────────────────────────────────────────────────────────────────────────

  + libstdcxx                               14.1.0  hc0a3c3a_1           conda-forge        4MB
  + libutf8proc                              2.8.0  h166bdaf_0           conda-forge     Cached
  + c-ares                                  1.33.1  heb4867d_0           conda-forge      183kB
  + libssh2                                 1.11.0  h0841786_0           conda-forge     Cached
  + keyutils                                 1.6.1  h166bdaf_0           conda-forge     Cached
  + aws-c-common                            0.9.28  hb9d3cd8_0           conda-forge      236kB
  + s2n                                      1.5.2  h7b32b05_0           conda-forge      352kB
  + libbrotlicommon                          1.1.0  hb9d3cd8_2           conda-forge       69kB
  + python_abi                                3.12  5_cp312              conda-forge        6kB
  + libiconv                                  1.17  hd590300_2           conda-forge     Cached
  + libevent                                2.1.12  hf998b51_1           conda-forge     Cached
  + libev                                     4.33  hd590300_2           conda-forge     Cached
  + libgfortran5                            14.1.0  hc5f4f2c_1           conda-forge        1MB
  + libedit                           3.1.20191231  he28a2e2_2           conda-forge     Cached
  + libstdcxx-ng                            14.1.0  h4852527_1           conda-forge       52kB
  + aws-checksums                           0.1.18  h756ea98_11          conda-forge       50kB
  + aws-c-cal                                0.7.4  hfd43aa1_1           conda-forge       48kB
  + aws-c-compression                       0.2.19  h756ea98_1           conda-forge       19kB
  + aws-c-sdkutils                          0.1.19  h756ea98_3           conda-forge       56kB
  + libbrotlienc                             1.1.0  hb9d3cd8_2           conda-forge      282kB
  + libbrotlidec                             1.1.0  hb9d3cd8_2           conda-forge       33kB
  + libgfortran                             14.1.0  h69a702a_1           conda-forge       52kB
  + icu                                       75.1  he02047a_0           conda-forge       12MB
  + lz4-c                                    1.9.4  hcb278e6_0           conda-forge     Cached
  + gflags                                   2.2.2  he1b5a44_1004        conda-forge     Cached
  + libnghttp2                              1.58.0  h47da74e_1           conda-forge     Cached
  + krb5                                    1.21.3  h659f571_0           conda-forge        1MB
  + libthrift                               0.20.0  h0e7cc3e_1           conda-forge      417kB
  + libabseil                           20240116.2  cxx17_he02047a_1     conda-forge        1MB
  + libcrc32c                                1.1.2  h9c3ff4c_0           conda-forge     Cached
  + zstd                                     1.5.6  ha6fb4c9_0           conda-forge     Cached
  + snappy                                   1.2.1  ha2e4443_0           conda-forge       42kB
  + aws-c-io                               0.14.18  hc2627b9_9           conda-forge      159kB
  + libgfortran-ng                          14.1.0  h69a702a_1           conda-forge       52kB
  + libxml2                                 2.12.7  he7c6b58_4           conda-forge      707kB
  + glog                                     0.7.1  hbabe93e_0           conda-forge      143kB
  + libre2-11                           2023.09.01  h5a48ba9_2           conda-forge     Cached
  + libprotobuf                             4.25.3  h08a7969_0           conda-forge     Cached
  + libcurl                                 8.10.0  hbbe4b11_0           conda-forge      425kB
  + aws-c-event-stream                       0.4.3  h235a6dd_1           conda-forge       54kB
  + aws-c-http                               0.8.9  h5e77a74_0           conda-forge      198kB
  + libopenblas                             0.3.27  pthreads_hac2b453_1  conda-forge     Cached
  + re2                                 2023.09.01  h7f4b329_2           conda-forge     Cached
  + orc                                      2.0.2  h669347b_0           conda-forge        1MB
  + azure-core-cpp                          1.13.0  h935415a_0           conda-forge      338kB
  + aws-c-mqtt                              0.10.5  h0009854_0           conda-forge      194kB
  + aws-c-auth                              0.7.30  hec5e740_0           conda-forge      107kB
  + libblas                                  3.9.0  23_linux64_openblas  conda-forge     Cached
  + libgrpc                                 1.62.2  h15f2491_0           conda-forge     Cached
  + azure-identity-cpp                       1.8.0  hd126650_2           conda-forge      200kB
  + azure-storage-common-cpp                12.7.0  h10ac4d7_1           conda-forge      143kB
  + aws-c-s3                                 0.6.5  hbaf354b_4           conda-forge      113kB
  + libcblas                                 3.9.0  23_linux64_openblas  conda-forge     Cached
  + liblapack                                3.9.0  23_linux64_openblas  conda-forge     Cached
  + libgoogle-cloud                         2.29.0  h435de7b_0           conda-forge        1MB
  + azure-storage-blobs-cpp                12.12.0  hd2e3451_0           conda-forge      523kB
  + aws-crt-cpp                             0.28.2  h6c0439f_6           conda-forge      350kB
  + numpy                                    2.1.1  py312h58c1407_0      conda-forge        8MB
  + libgoogle-cloud-storage                 2.29.0  h0121fbd_0           conda-forge      782kB
  + azure-storage-files-datalake-cpp       12.11.0  h325d260_1           conda-forge      274kB
  + aws-sdk-cpp                           1.11.379  h5a9005d_9           conda-forge        3MB
  + libarrow                                17.0.0  hc80a628_14_cpu      conda-forge        9MB
  + libarrow-acero                          17.0.0  h5888daf_14_cpu      conda-forge      608kB
  + libparquet                              17.0.0  h39682fd_14_cpu      conda-forge        1MB
  + pyarrow-core                            17.0.0  py312h9cafe31_1_cpu  conda-forge        5MB
  + libarrow-dataset                        17.0.0  h5888daf_14_cpu      conda-forge      585kB
  + libarrow-substrait                      17.0.0  hf54134d_14_cpu      conda-forge      550kB
  + pyarrow                                 17.0.0  py312h9cebb41_1      conda-forge       26kB

  Summary:

  Install: 93 packages

  Total download: 66MB

─────────────────────────────────────────────────────────────────────────────────────────────────

Confirm changes: [Y/n] 

As a maintainer of a library which has nothing to do with cloud computing, why on earth would I want to have aws AND azure libraries as transient dependencies? Even if I was doing cloud stuff, I'd probably be working with one of them, not both. That's an insane number of bloatware installed when pulling pyarrow from conda-forge.

On top of that, we've had the time where pyarrow simply gave up on pypi and others had to step in, due to C++ compat issues. I understand the challenges of pypi, but this doesn't give me confidence.

And the cherry on top is your employer firing pyarrow maintainers, including some of my friends, who have been working on the project for a while. Not only that doesn't make me want to have the lib as a dependency, it also doesn't give me confidence on the future of the project.

lostmygithubaccount commented 1 month ago

But then it became a massive piece of software which is a kind of dependency that you'd only want to add if you really have to. Otherwise rather keep the environments lightweight.

out of curiosity what's your bar for a lightweight Python environment? some # of MBs?

I don't personally use conda but it does seem like you can get similarly sized installations as PyPI: https://arrow.apache.org/docs/python/install.html#python-conda-differences

I know there are ongoing efforts to reduce the installation size of PyArrow further (and reduce dependencies)

Cheukting commented 1 month ago

@MarcoGorelli this looks fun :-) Do you think this is ready to accept a PR? May I give it a go during the sprint tomorrow?

MarcoGorelli commented 1 month ago

Thanks @Cheukting ! I'm not 100% sure about this one, as I'd originally misunderstood the Fairlearn use case. Maybe we can punt on it for the time being

We'll open some issues later which we've reserved specially for the sprint though so there'll be plenty of interesting things to work on 😎

Cheukting commented 1 month ago

Ok, thanks @MarcoGorelli but in the future if this is needed I am happy to help too

jonmmease commented 3 weeks ago

Would it be feasible for from_dict to optionally accept an existing narwhals DataFrame as an argument, and then use this same backend for the returned DataFrame? Something like:

values_df = nw.from_dict({'a': [1, 2, 3], 'b': [4,5,6]}, like=other_df)

Or perhaps there is a better representation of a backend that could be used instead of the DataFrame itself.

The motivation would be to make it easy to have a function that accepts and returns a Narwhals DataFrame using the same backend, but where the resulting DataFrame isn't computed directly from the input DataFrame.

I can create a separate issue for this, but for my actual usecase in VegaFusion, I'd actually want something like nw.from_arrow_capsule(cap, like=other_df). The flow is that I'd like to use Narwhals for basic column projection and schema inspection and then use the Arrow PyCapsule API to pass the result to Rust. Then in some cases, the Rust logic will return a new Arrow result in PyCapsule form, and it would be great to be able to use Narwhals to wrap this result using the same backend as the input.

MarcoGorelli commented 3 weeks ago

Hey @jonmmease

If I've understood the request, I think you can do this already:

In [13]: other_df = nw.from_native(pl.DataFrame({'a': [2, 3]}))

In [14]: values_df = nw.from_dict({'a': [1, 2, 3], 'b': [4,5,6]}, native_namespace=nw.get_native_namespace(other_df))

In [15]: values_df
Out[15]:
┌───────────────────────────────────────┐
| Narwhals DataFrame                    |
| Use `.to_native` to see native output |
└───────────────────────────────────────┘

In [16]: values_df.to_native()
Out[16]:
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 4   │
│ 2   ┆ 5   │
│ 3   ┆ 6   │
└─────┴─────┘
jonmmease commented 3 weeks ago

Ah, nice. I had missed the native_namespace bit. I saw this issue and assumed that from_dict was new, but now I understand that it's the default behavior. Sorry for the noise!