matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
346 stars 25 forks source link

How to include structural zeros? #152

Open windisch opened 1 year ago

windisch commented 1 year ago

What's the preferred way to model structural zeros in a Formula?

Assume the following toy example: I have a $3\times 2$ contingency table that looks like this

e f
a 1 0
b 2 3
c 4 0

given as a pandas dataframe as follows:

df = pd.DataFrame(
    data={
        'F1': ['a', 'a', 'b', 'b', 'c', 'c'],
        'F2': ['e', 'f', 'e', 'f', 'e', 'f'],
        'n': [ 1, 0, 2, 3, 4, 0]
    })

The combinations $(a, f)$ and $(c,f)$ are structural zeros (i.e., it's impossible to have non-zero values in these cells). Now, assume I want to fit the model n ~ C(F1):C(F2) on that data as follows

y, X = Formula('n ~ C(F1):C(F2)').get_model_matrix(df, ensure_full_rank=False)

then the corresponding variables C(F1)[T.a]:C(F2)[T.f] and C(F1)[T.c]:C(F2)[T.f] are columns of X. Is there a way to remove these parameters already in the formula? Is there another concept in formulaic to deal with this type of constraints?

matthewwardrop commented 1 year ago

Hi @windisch ,

Apologies for the delay in my response. Life has been pretty hectic of late.

At present, there is no way to handle this in Formulaic (short of deleting these columns after the model matrix is created). Is there precedent for supporting this kind of transformation in other formula implementations? (This isn't a requisite for including it in Formulaic, but it does help to think through how others have solved this issue).

If we were to add support for this, I think the easiest approach would be to generate the matrix as is, and then remove any columns that are identically zero. This does mean that some unnecessary work is done, which is a little inelegant... but I'm not sure it makes sense to pass around richer metadata than this. Of course, that means it could just as easily be done outside formulaic too.

In an ideal world, what would you like to see done?