py-econometrics / pyfixest

Fast High-Dimensional Fixed Effects Regression in Python following fixest-syntax
https://py-econometrics.github.io/pyfixest/
MIT License
152 stars 28 forks source link

narwhals = Pandas + polars + ... #533

Open juanitorduz opened 3 months ago

juanitorduz commented 3 months ago

Use https://github.com/narwhals-dev/narwhals to support pandas and polars!

This seems to be a very cool alternative to support various backends. See for example https://github.com/koaning/scikit-lego/pull/671

MarcoGorelli commented 3 months ago

Hey, just wanted to stop by and say - thanks for your interest! Feel free to book some time on https://calendly.com/marcogorelli if you'd like to chat about how Narwhals could help PyFixest

s3alfisc commented 3 months ago

Hi both (@MarcoGorelli and @juanitorduz) - I've now thought about it for 15 minutes and I think narwhals might be a great solution for PyFixest! Thanks for offering to chat @MarcoGorelli , I'll book an appointment =)

Just some background on pyfixest and how it works with Data Frames: most of the data manipulation happens via the formulaic library, which requires an input pd.DataFrame. I.e. a usual flow looks like this:

%load_ext autoreload
%autoreload 2

import polars as pl
import pandas as pd
import pyfixest as pf

from formulaic import model_matrix
import narwhals as nw

data = pl.DataFrame(pf.get_data())

def feols(data):

    if isinstance(data, pl.DataFrame):
        data = data.to_pandas()

    # model_matrix requires a pandas DataFrame and returns a pandas DataFrame
    Y, X = model_matrix("Y ~ X1", data = data, output = "pandas")

    # some more pandas manipulations
    Y.dropna(inplace = True)
    X.dropna(inplace = True)

    return Y.to_numpy(), X.to_numpy()

Via narwhals, it could look as

def feols_nw(data, use_polars = False):

    data = nw.from_native(data)

    # model_matrix requires a pandas DataFrame and returns a pandas DataFrame
    Y, X = model_matrix("Y ~ X1", data = data.to_pandas(), output = "pandas")

    if use_polars:
        # another copy? potentially costly? 
        Y = nw.from_native(Y)
        X = nw.from_native(X)

    # some more pandas manipulations
    Y.dropna(inplace = True)
    X.dropna(inplace = True)

    return Y.to_numpy(), X.to_numpy()
MarcoGorelli commented 3 months ago

Hey! Thanks for your explanation - if formulaic requires specifically pandas input/output, and then that might be a good candidate for Narwhalification :) I'll take a look, thanks!

    # another copy? potentially costly? 
   Y = nw.from_native(Y)

Just to clarify, from_native just wraps a dataframe in a narwhals.DataFrame - it's a virtually free operation, only takes a few microseconds, and doesn't do any copies - Narwhals only translates syntax

juanitorduz commented 3 months ago

Naive question: It seems formulaic supports pyarrow.Table. Could this be a shortcut for Polars integration? https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.to_arrow.html

MarcoGorelli commented 3 months ago

totally!