py-econometrics / pyfixest

Fast High-Dimensional Fixed Effects Regression in Python following fixest-syntax
https://py-econometrics.github.io/pyfixest/pyfixest.html
MIT License
118 stars 27 forks source link

Interface with numpy arrays #447

Open apoorvalal opened 1 month ago

apoorvalal commented 1 month ago

Would be good to have a fixest analogue of sm.OLS that lets one use raw matrices and sidestep the need to construct data frames. This would not do the current (admittedly quite user-friendly but not very OOP) thing of fitting under the hood when initialized with pf.feols but would instead have modular .fit(), .summary() etc methods.

Is this currently possible and not documented?

s3alfisc commented 1 month ago

To some degree this is already currently possible, but not documented as a) I am not directly testing the API and b) because it has some quirks and is far from user friendly.

But yes, in principle, there is a Feols() class that only works on numpy arrays. In the current design, you would have to do the "demeaning" outside of Feols, but this could easily be changed. The Feiv and Fepois classes inherit from Feols.

This is how it works under the hood at the moment:

In a first step, we load all required functions and classes and create some data:

import numpy as np
from pyfixest.estimation.feols_ import Feols
from pyfixest.estimation.demean_ import demean
from pyfixest.estimation.feols_ import _get_vcov_type

N = 1000
X = np.random.normal(0, 1, N*2).reshape(N, 2)
f = np.random.choice(range(100), N, True).reshape(N, 1)
Y = X @  np.array([1,1]).reshape((2,1)) + f + np.random.normal(0, 1, N).reshape(N, 1)

# currently required inputs to Feols, should all be optional
weights = np.ones(N)
coefnames = ['x1', 'x2']
collin_tol = 1e-10
weights_name = None
weights_type = "fweights"

Now we need to demean Y and X:

# weights needs to be 1d for demean ... 
Y_d, _ = demean(Y, f, weights)
X_d, _ = demean(X, f, weights)

# weights needs to be 2d for Feols ... 
weights = weights.reshape((N,1))

We can then pass the demeaned data (and a lot of additional arguments) to the Feols class:

Fit = Feols(Y = Y_d, X = X_d, weights = weights, coefnames = coefnames, collin_tol = 1e-10, weights_name = weights_name, weights_type = weights_type)
Fit
# <pyfixest.estimation.feols_.Feols at 0x1af06c7ae90>

With this at hand, we can fit the model via the get_fit() method:

Fit.get_fit()
Fit._beta_hat
# array([0.9968034 , 1.01047411])

We can compute the vcov matrix and get CIs:

vcov_type = _get_vcov_type(vcov = "iid", fval = "f1")
fit.vcov(vcov = vcov_type)
fit.get_inference()

and can finally call .tidy():

fit.tidy()

I think I'd have to implement the following changes to really promote users to rely on Feols:

All of this sounds certainly doable. Given that you've basically implemented the Gelbach-D already, I think I might be able to tackle this soon =)