hlorenzo / py_ddspls

Multi (& Mono) Data-Driven Sparse PLS
MIT License
4 stars 2 forks source link

question about ts attribute #3

Open TalWac opened 2 years ago

TalWac commented 2 years ago

Dear phscha, A new version of py_ddspls has bean released (version 1.1.1) taking into account deflation and Ridge regularization. Thanks to this new version, I can show you those figures:

save

One the first figure you can see the RMSEP (kfolds=10) for R=1 (blue), R=2 (red) and R=3 (green). You can notice that errors are similar for \lambda>~0.67 for the 3 solutions, while error is largely higher for R=1 as soon as \lambda<~0.67 and also higher for R=2as soon as \lambda<~0.45. One the second figure are given the norms of the 1st (black) 2nd (green) and 3rd (purple) component for the model with R=3. You can see that \lambda~0.67 corresponds to the annulation of the 2nd component while \lambda~0.45 to the annulation of the 3rd component.

According to that analysis it would be difficult to find uncorrelated components (2 components) describing that Y response better than a model with a single component unless you accept variables correlated with less than ~0.67 with the response. This corresponds to:

Variables selected on component 1 are number [23 24 29 34 45 46 48 49 53 54 55 56]

Thank you for your help PS: This issue has allowed me to fix a bug on the R package. Thank you for this too. PPS: Here is the code used (with version ddspls >1.1.1) :

import py_ddspls as dd
import pickle
from matplotlib import pyplot
import numpy as np
# Open data
X = pickle.load(open("data_ddslps_x.obj","rb"))
Y = pickle.load(open("data_ddslps_y.obj","rb"))
# 10-folds cross-validation on 7 cpus
R = 2
n_lambd = 20
kfolds = 10
lambd_min = 0
cv_R_1 = dd.model.perf_ddspls(X,Y,R=1,kfolds=kfolds,lambd_min=lambd_min,lambd_max=0.98,
                                      n_lambd=n_lambd,NCORES=7,mode="reg",
                                      deflat=True,mu=0.001)
cv_R_2 = dd.model.perf_ddspls(X,Y,R=2,kfolds=kfolds,lambd_min=lambd_min,lambd_max=0.98,
                                      n_lambd=n_lambd,NCORES=7,mode="reg",
                                      deflat=True,mu=0.001)
cv_R_3 = dd.model.perf_ddspls(X,Y,R=3,kfolds=kfolds,lambd_min=lambd_min,lambd_max=0.98,
                                      n_lambd=n_lambd,NCORES=7,mode="reg",
                                      deflat=True,mu=0.001)
x = cv_R_1[:,0]
vars_t = np.zeros((len(x),3))
for i in range(len(x)):
    mod=dd.model.ddspls(X,Y,lambd=cv_R_1[i,1],R=3,mode='reg',deflat=True,mu=0.001)
    vars_t[i,0] = np.linalg.norm(mod.model.ts[0])
    vars_t[i,1] = np.linalg.norm(mod.model.ts[1])
    vars_t[i,2] = np.linalg.norm(mod.model.ts[2])
# Variance annulations
annul_var_T2 = cv_R_1[np.where(vars_t[:,1]==0)[0][0],1]
annul_var_T3 = cv_R_1[np.where(vars_t[:,2]==0)[0][0],1]
# Plot
fig, axs = pyplot.subplots(nrows=2, ncols=1, constrained_layout=True)
fig.set_size_inches(5, 5)
axs[0].vlines(annul_var_T2 ,0.2,1,colors="gray")
axs[0].vlines(annul_var_T3 ,0.2,1,colors="gray")
axs[0].plot(cv_R_1[:,1],cv_R_1[:,2],label='No deflation, R=1',marker="X")
axs[0].plot(cv_R_2[:,1],cv_R_2[:,2],c='r',  label='Deflation, R=2',marker='+')
axs[0].plot(cv_R_3[:,1],cv_R_3[:,2],c='green',  label='Deflation, R=3',marker='$o$')
axs[0].legend(loc='best');
axs[0].set_xlabel("$\lambda$")
axs[0].set_title("RMSEP for R=1, R=2 and R=3\n Cross-Validation")
axs[1].hlines(0,np.min(cv_R_1[:,1]),np.max(cv_R_1[:,1]),colors="gray")
axs[1].vlines(annul_var_T2 ,-5,120,colors="gray")
axs[1].vlines(annul_var_T3 ,-5,120,colors="gray")
axs[1].plot(cv_R_1[:,1],vars_t[:,0],label='First comp',marker='X',c="black")
axs[1].plot(cv_R_2[:,1],vars_t[:,1],label='Second comp',c="g",marker='+')
axs[1].plot(cv_R_3[:,1],vars_t[:,2],label='Third comp',c="purple",marker='1')
axs[1].set_title("Norms of components for R=3\n Model")
axs[1].legend(loc='best');
axs[1].set_xlabel("$\lambda$")
# Model
mod=dd.model.ddspls(X,Y,lambd=annul_var_T2,R=1,mode='reg',deflat=True,mu=0.001)
u = mod.model.u[0]
var_select_comp_1 = np.where(u[:,0]!=0)[0]
print("Variables selected on component 1 are number "+str(var_select_comp_1+1))

_Originally posted by @hlorenzo in https://github.com/hlorenzo/py_ddspls/issues/1#issuecomment-600247211_

TalWac commented 2 years ago

Dear developer,

thank you for the package and the examples code!

Wanted to ask what are the meaning of the ts attribute. in the code written

ts dict: length R. Each element is a nXK matrix : the scores per axis per block

If R in the number of principal components, n number of data points\ observations and K is the number of blocks in the X matrix. you used norm of ts in the code above and I do not understand what it represent (mod.model.ts[0]).

Kindly your help