matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
313 stars 21 forks source link

Formulaic not raising an exception when required fields are missing in the dataset #157

Closed hguturu closed 9 months ago

hguturu commented 9 months ago

I am trying to make a design matrix from a master matrix of parameters.

all_phenotypes = pd.DataFrame({ "(AltGrp)": [1, 0, 0, 1, 0, 1], "BinGrp": [0, 0, 0, 1, 1, 1], "ContGrp" : [1,2,3,4,5,6]})

design = formulaic.model_matrix(["(AltGrp) + BinGrp"], all_phenotypes)

yields

   BinGrp
0       0
1       0
2       0
3       1
4       1
5       1

I assume this is due to the () in (AltGrp). I was curious if there are other special characters that should be excluded since this fails silently so I want to avoid passing in the wrong matrix in the future.

matthewwardrop commented 9 months ago

Hi @hguturu ,

Parentheses if formulae have special meaning (they are grouping order-of-operation operators). You can refere to the formula grammar docs for more info. You'll also find there how to quote special characters that should be included in field names; for example:

In [12]: all_phenotypes = pd.DataFrame({ "(AltGrp)": [1, 0, 0, 1, 0, 1], "BinGrp": [0, 0, 0, 1,
    ...: 1, 1], "ContGrp" : [1,2,3,4,5,6]})
    ...:
    ...: design = formulaic.model_matrix(["`(AltGrp)` + BinGrp"], all_phenotypes)

In [13]: design
Out[13]:
   (AltGrp)  BinGrp
0         1       0
1         0       0
2         0       0
3         1       1
4         0       1
5         1       1

However, there is a bug here... AltGrp is not found in the data sets, but is not throwing an exception. This is a regression, and so I'll make sure it gets fixed.

matthewwardrop commented 9 months ago

Ah... I see you opened an issue about this separately anyway (#159 ). Closing this one in favour of that.