pydata / patsy

Describing statistical models in Python using symbolic formulas
Other
947 stars 103 forks source link

Question: Programmatically creating splines and applying knots to new data #121

Open BrianMiner opened 6 years ago

BrianMiner commented 6 years ago

I have found that I can create a spline on training data and then apply to test data like this:

create

x1= dmatrix("cr(x, df=3) - 1", {"x":TRAIN_DATA.VARIABLE.values})

apply

xx1=build_design_matrices([x1.design_info], {"x":TEST_DATA.VARIABLE.values })

This works but of course requires manually creating variables or trying to programatically creating strings.

Is there anyway to do something like this patsy.cr(x, df=5)

and grab the knots to apply to new data using the same function cr()?

thequackdaddy commented 6 years ago

I'm not really an expert, so there's likely an oversight here.

First, do you need to know the knots for some reason? If not, I think the canonical way would be to do something like...

# Build the design matrix
x = np.arange(100)
dm = patsy.dmatrix('cr(x, df=5)', {'x': x})

# Apply design matrix to new data... 
new_data = np.arange(25, 75)
patsy.dmatrix(dm.design_info, {'x': new_data})

If you really want to know what the knots were, you could probably dig through the dm.design_info object and find it.

However, it may be a little easier to pull the CR class out of the cr stateful transform function.

cr = patsy.cr.__patsy_stateful_transform__()
cr.memorize_chunk(x, df=5)
cr.memorize_finish()

cr._all_knots

You could also apply to the new data using...

cr.transform(new_data)