matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
313 stars 21 forks source link

Formulaic struggles with NAs and `poly()` syntax #150

Closed s3alfisc closed 9 months ago

s3alfisc commented 10 months ago

Hi @matthewwardrop,

I think I encountered a bug - model_matrix struggles when there is a np.nan value in a covariates when poly() is used in the formula.

Here are two (hopefully) reproducible examples:

In example one, there is a np.nan in a covariate not wrapped in poly(), which causes a numpy attribute error. In the second example, there is a np.nan in the covariate wrapped in poly(), which causes the polynomial expansions to be populated with NA values (instead of dropping the missing values).

In case there are no NA values, poly() fully works as expected.

Beyond this, I wanted to say that formulaic is a fantastic package, and I thoroughly enjoy building on top of it! =)

import pandas as pd # installed version: 1.5.3
import numpy as np # installed version: 1.22.4
from formulaic import model_matrix # installed version: 0.6.0

N = 10
rng = np.random.default_rng(10)
Y = rng.normal(0, 1, N)
X1 = rng.normal(0, 1, N)
X2 = rng.normal(0, 1, N)
X1[0] = np.nan

df = pd.DataFrame({'Y':Y, 'X1':X1, 'X2':X2})

model_matrix("Y ~ X1 + poly(X2, 2)", data=df)
#AttributeError: 'numpy.ndarray' object has no attribute 'drop'

df = df.dropna()
df.loc[1, "X2"] = np.nan
model_matrix("Y ~ X1 + poly(X2, 2)", data=df)
#.lhs:
#              Y
#    1 -0.725025
#    2 -0.781805
#    3  0.266976
#    4 -0.248581
#    5  0.126483
#    6  0.843043
#    7  0.857937
#    8  0.475184
#    9 -0.450769
#.rhs:
#       Intercept        X1  poly(X2, 2)[1]  poly(X2, 2)[2]
#    1        1.0 -0.814814             NaN             NaN
#    2        1.0 -0.343855             NaN             NaN
#    3        1.0 -0.051380             NaN             NaN
#    4        1.0 -0.972274             NaN             NaN
#    5        1.0 -1.134488             NaN             NaN
#    6        1.0  0.305705             NaN             NaN
#    7        1.0 -1.851685             NaN             NaN
#    8        1.0 -0.177054             NaN             NaN
#    9        1.0  0.425826             NaN             NaN
matthewwardrop commented 10 months ago

Thanks so much for taking the time to report this @s3alfisc ! I'll try to get it fixed in the next patch release!

matthewwardrop commented 9 months ago

Hi @s3alfisc! This should now all work properly in 0.6.5 . Let me know if you run into any further troubles!

s3alfisc commented 9 months ago

Awesome! I'll let you know in case I spot anything else :)