py-econometrics / pyfixest

Fast High-Dimensional Fixed Effects Regression in Python following fixest-syntax
https://py-econometrics.github.io/pyfixest/
MIT License
176 stars 35 forks source link

`_narwhals_to_pandas` #700

Closed vincentarelbundock closed 1 week ago

vincentarelbundock commented 1 week ago

This is proof of concept to partially solve #533. It is a minimal change that swaps the existing _polars_to_pandas() by a _narwhals_to_pandas().

Let me know if you want me to continue with this approach, and what you would expect in a merge-ready PR.

Here’s an example where pyfixest ingests a DuckDB table for both fitting and prediction:

import pyfixest as pf
import duckdb

data = pf.get_data()
data = duckdb.query("SELECT * FROM data")

type(data)

    <class 'duckdb.duckdb.DuckDBPyRelation'>
fit = pf.feols("Y ~ X1 | f1 + f2", data=data)

fit.predict(newdata = data)[:10]
[ 2.20341554         nan         nan  3.23931955 -1.44371374 -1.29643938
 -1.86119629 -1.34576434  1.38234328 -2.24958648]
fit.summary()
Estimation:  OLS
Dep. var.: Y, Fixed effects: f1+f2
Inference:  CRV1
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -0.919 |        0.065 |   -14.057 |      0.000 | -1.053 |  -0.786 |
---
RMSE: 1.441 R2: 0.609 R2 Within: 0.2 
None
s3alfisc commented 1 week ago

pre-commit.ci autofix

s3alfisc commented 1 week ago

This looks really cool! I love that narwhals handles not only polars but also duckdb. Thank you =)

I actually think this is 95% ready to be merged. Maybe we could add a small API test here that verifies that pf.feols() returns identical results, irrespective of the input data frame type. Additionally, would we have to change the type description for the input data frame type for the pf.feols and pf.fepois functions? It currently uses a custom DataFrameType and asks the linter not to check - I would think that narwhals might support a type hint natively?

s3alfisc commented 1 week ago

Took a quick look at the narwhals docs and the relevant type hint seems to be narwhals.typing.IntoDataFrame:

from __future__ import annotations

import narwhals as nw
from narwhals.typing import IntoDataFrame

def func(df_native: IntoDataFrame) -> tuple[int, int]:
    df = nw.from_native(df_native, eager_only=True)
    return df.shape
s3alfisc commented 1 week ago

All the test errors are struggles or rpy2, which happens sometimes. I'll take a look at these @vincentarelbundock =)

vincentarelbundock commented 1 week ago

Excellent! I changed the type hint and added a small test.

Don't know what your policy is for changelog and version number while in dev. happy to make a change if you indicate or let you do it yourself in my branch before/after merge.

s3alfisc commented 1 week ago

Don't know what your policy is for changelog and version number while in dev. happy to make a change if you indicate or let you do it yourself in my branch before/after merge.

Usually we don't bump versions when merging from dev to main, which I suppose is bad practice? Changelog are automatically handled by a release bot, which automatically updates the github release notes.

Do you have a recommendation on how to potentially improve these processes?

s3alfisc commented 1 week ago

pre-commit.ci autofix

vincentarelbundock commented 1 week ago

Usually we don't bump versions when merging from dev to main, which I suppose is bad practice?

IIUC, semantic versioning allows labels like 1.0.0-dev, 1.0.0-dev.1, 1.0.0-alpha. I think that's useful for developers when they need to check if they are running pypi or local install. In my own packages, I'm not super strict about this, but I do find the dev label useful.

Changelog are automatically handled by a release bot, which automatically updates the github release notes.

Github release are really cool, but they feel a bit developer-centric to me; users have to wade through a lot of internal housekeeping stuff and commit messages before getting to the good stuff.

I like to write my own NEWS file and treat it as a user-facing document, emphasizing breaking changes, new features, and qualitative description of bug reports. But that obviously requires some discipline, and if we don't have that, then the NEWS quickly becomes useless.

Not a big deal either way.

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 92.30769% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pyfixest/estimation/feols_.py 50.00% 1 Missing :warning:
Flag Coverage Δ
core-tests 77.82% <92.30%> (+0.14%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pyfixest/estimation/FixestMulti_.py 80.20% <100.00%> (ø)
pyfixest/estimation/estimation.py 96.96% <100.00%> (+1.09%) :arrow_up:
pyfixest/estimation/vcov_utils.py 51.42% <100.00%> (ø)
pyfixest/utils/dev_utils.py 84.21% <100.00%> (+1.49%) :arrow_up:
pyfixest/estimation/feols_.py 83.24% <50.00%> (+0.25%) :arrow_up:
s3alfisc commented 1 week ago

@all-contributors please add @vincentarelbundock for code @all-contributors please add @MarcoGorelli for review

allcontributors[bot] commented 1 week ago

@s3alfisc

I've put up a pull request to add @vincentarelbundock! :tada:

s3alfisc commented 1 week ago

@all-contributors please add @MarcoGorelli for review

allcontributors[bot] commented 1 week ago

@s3alfisc

I've put up a pull request to add @MarcoGorelli! :tada: