matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
346 stars 25 forks source link

Add narwhals materializer for dataframe agnosticism. #189

Open matthewwardrop opened 3 months ago

matthewwardrop commented 3 months ago

This patch adds initial support for narwhals. It's... patchy.

You can try it out using:

import narwhals as nw
import pandas as pd
from formulaic import model_matrix

df = pd.DataFrame({"a": [1,2,None], "b": list("abc"), "c": pd.Categorical(list("abc"))})
model_matrix("a + b + c", df, materializer='narwhals', na_action="ignore")

image

import polars
model_matrix("a + b + c", polars.DataFrame._from_pandas(df), materializer='narwhals', na_action="ignore")

image

Note: The polars backend panics when na_action is not ignore.

There's a lot of hacks here, including fallbacks to pandas objects in places, and of course we still want sparse materialisation to work (which I don't think other backends support sufficiently to replace scipy sparse matrices).

matthewwardrop commented 3 months ago

As discussed in #187 , here's an initial PR for your review @MarcoGorelli. It's definitely not ready to replace the default backend, but it isn't toooo far either.

matthewwardrop commented 3 months ago

@MarcoGorelli Sorry, just in case it wasn't entirely clear from our conversation in #187 and my tagging you here, I was hoping you could take a look and suggest how to fill the gaps and/or improve things. No worries if you do not have time for now.

MarcoGorelli commented 3 months ago

Thanks for the ping! Yup, definitely taking a look, just had some holiday recently :palm_tree:

MarcoGorelli commented 2 months ago

hey - quick update, I've got most of it working in a branch, but am a bit busy with some conferences now - will get back to it soon. realistically I hope to have something review-ready in October 🤞 this has been quite interesting to work on!