jameschapman19 / cca_zoo

Canonical Correlation Analysis Zoo: A collection of Regularized, Deep Learning based, Kernel, and Probabilistic methods in a scikit-learn style framework
https://cca-zoo.readthedocs.io/en/latest/
MIT License
190 stars 40 forks source link

predict() method for models? #182

Open dmeliza opened 1 year ago

dmeliza commented 1 year ago

The scikit-learn implementations of PLS and CCA have predict() methods that are very useful for cross-validation and forecasting. Is it possible to add these to cca-zoo models where appropriate?

jameschapman19 commented 1 year ago

Pushed a version of this to main

jameschapman19 commented 1 year ago

Works slightly differently to scikit-learn you pass views with optional missing views as None and it reconstructs all of the views from the learnt latent dimensions.

dmeliza commented 1 year ago

Thanks! I'll check it out.

dmeliza commented 12 months ago

This works well with my data, but only if the view data are whitened first. I'm not enough of an expert in these methods to say why this might be, but it looks like the methods for generating predictions are quite different in cca-zoo compared to sklearn's PLSRegression.

jameschapman19 commented 12 months ago

If you come back to me in a week and a half I think I will be able to come up with a more detailed response and fix.

Basically your observation is exactly what I would expect and a colleague of mine has been thinking about this in some depth recently.

We learn weights W_x which transform XW_x=Z_x and W_y which transform YW_y=Z_y. Going from data to latent space is usually known as a backward problem.

For prediction (or 'generation') we need a forward problem.

For PLS it turns out the forward problem is X=ZW_x^T and Y=ZW_y^T

But for CCA the forward problem is actually X=ZW_x^T\Sigma_X and Y=ZW_y^T\Sigma_Y.

The predict function I wrote up quickly for you uses the PLS forward problem (because that's what scikit-learn appears to do).

But notice that if Sigma_X is Identity then the forward problems are the same. Sigma_X is identity when your data is whitened and that's why you are seeing what you are seeing.

Based on the above you might be able to implement a CCA prediction function without my help and if you do get a change feel free to send a PR :) otherwise I'll do it when I get a moment.

dmeliza commented 12 months ago

I've been digging through the code and looking at weights, scores, loadings with my data, and I'm starting to think prediction may be broken for some models in scikit-learn.

To set the context, Y is 58000 by 40 and X is 58000 by 1500. sklearn's PLSRegression works reasonably well with about 10 components; sklearn.cross_decomposition.PLSCanonical, cca_zoo.linear.PLS and cca_zoo.linear.CCA all produce horrible in-sample predictions unless I whiten the inputs. However, whitening totally destroys out-of-sample performance, so it's not an option.

For PLSRegression (i.e. PLS2), prediction works great for unwhitened data. The class computes a "rotation matrix" Pₓ that gives Zₓ = XPₓ. It's using Pₓ = Wₓ(ΓᵀWₓ)^{-1} rather than just Wₓ as in your example above. Γᵀ being the matrix of X loadings. Then the prediction is Y = XPₓΔᵀ where Δᵀ is the matrix of loadings for Y. This works because Z_y ≈ Zₓα, with α = 1: if I fit a line through the X and Y scores it has an intercept of 0 and a slope of 1.

For PLSCanonical, which I think is the same flavor of PLS as cca_zoo.linear.PLS, α is not equal to 1, and it's different for each of the components. So the predictions from the different components are not being scaled appropriately, and the overall predictions look like garbage, because the first component accounts for the lion's share of the variance. I am guessing that this α is the same as your σ in your post above?

The reason I think there's an error in sklearn is that according to the User Guide, this factor α needs to be inferred from the data, but I don't see anywhere in the code that it does this. This is my very naive way of trying to fix it:

fm = LinearRegression()
fm.fit(model._x_scores, model._y_scores)
alpha = np.diag(np.diag(fm.coef_))

pred = X_test_scaled @ model.x_rotations_ @ alpha @ model.y_loadings_.T

It seems to work, although I'm sure there's a better way to get α than multiple regression. I haven't tried yet with CCA. If you have a more sophisticated solution I'm happy to write up a PR, and I can submit an issue to sklearn as well.