matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
313 stars 21 forks source link

Retain Column Names for sparse model matrices #153

Closed s3alfisc closed 9 months ago

s3alfisc commented 9 months ago

Hi @matthewwardrop , this is (maybe) a feature request =). Is it currently possible to obtain column names for sparse model matrices? I could not find anything in the docs, and sparse formulaic.model_matrix.ModelMatrix instances do not seem to have a columns attribute.

E.g. if I create the following matrix, is there any way to get the column names that output = pandas would create?

from formulaic import model_matrix
import pandas as pd
import numpy as np

a = np.random.choice(["a", "b"], 10, replace=True)
df = pd.DataFrame({"a": a})

mm = model_matrix("a", df, output = "sparse")
mm = model_matrix("a", df, output = "pandas")
mm.columns
# Index(['Intercept', 'a[T.b]'], dtype='object')

My personal use case in which this becomes relevant is for high dimensional fixed effects regression of the type $Y = X\beta + D \alpha +u$, where in a first step, one estimates $\beta$ on demeaned $Y$ and $X$ (via the Frisch Waughn theorem) and then obtains an estimate for the fixed effects $\hat{\alpha}$ by solving the sparse system $(Y-X \hat{\beta}) = D \alpha $.

Finally, once again - thanks for a wonderful package! 😄

Best, Alex

matthewwardrop commented 9 months ago

Hi Alex !

Thanks for your kind words!

The column names (and much more) are available on the movel spec; e.g. mm.model_spec.column_names. You can read more about this here: https://matthewwardrop.github.io/formulaic/guides/model_specs/ .

Let me know if you need more guidance, and perhaps where you looked for the documentation so I know where to add a few more cross-linkings.

s3alfisc commented 9 months ago

Sorry for taking a while to respond - I have to admit that I was a little embarrassed that I oversaw the part on 'model_spec' in the docs. Thanks for pointing me at it so quickly!

I mostly looked through "QuickStart" and "how it works" and only read halfway through the "model specs" section. Sometimes I code late in the evenings, I guess that might explain why I oversaw the relevant section? 😀

To help late night coders, maybe you could consider to add a link to the model specs section at the end of "QuickStart"? Then I would certainly have found it 😅

Another potential way to help users navigate the docs could be to make them searchable via a search bar (though I am not sure how much effort that is).

matthewwardrop commented 9 months ago

Thanks for the suggestions! And yes, I forgot to re-enable search! It's a one-line patch!

matthewwardrop commented 9 months ago

I've updated the docs with these quick fixes. Thanks again!