Closed phscha closed 4 years ago
Dear phscha,
Thank you for interest for that work. It is indeed a young project.
You said Xnorm is a 888x62 dense matrix and Ynorm a 88x1 dense matrix
Did you mean that Xnorm and Ynorm did not have the same number of individuals (which is supposed to be in the rows in that software) ?
Also, the current version deals only with dictionnaries in covariates. Please retry with something like
X_dict = {0:Xnorm}
mod=dd.model.ddspls(X_dict,Y,...)
I hope this will help, feel free to ask again!
Dear Hadrien, thanks a lot for the fast response! X and Y have the same number of rows (888). There was a typo in my first message. I tried your suggestion with X_dict = {0:Xnorm}. The result is the same error message as in the original post.
Are you sure that Xnorm and Ynorm are numpy arrays ? Do you think you could send me just an extract of your tables ? Like 10 rows and a few columns ? Bests,
Please find the pickled pair (X_dict, Y_mat) attached.
I do not know the .obj format. Would you give me a script to open it ? Bests, Hadrien
Le mer. 11 mars 2020 à 21:43, phscha notifications@github.com a écrit :
Please find the pickled pair (X_dict, Y_mat) attached.
data_ddspls.zip https://github.com/hlorenzo/py_ddspls/files/4320795/data_ddspls.zip
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hlorenzo/py_ddspls/issues/1#issuecomment-597869100, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG33PHPLRITWXUWYHNYMJTRG7ZW5ANCNFSM4LD4WTBQ .
Sorry, my bad. I've pickled python variables; you can load them using
import pickle
X = pickle.load(open('data_ddslps_x.obj','rb'))
Y = pickle.load(open('data_ddslps_y.obj','rb'))
...
Thank you for the script I started it on Python 3.7 using Spyder. It seemed to work well in my case, see below. Which python do you use ?
import py_ddspls as dd
import sklearn.metrics as sklm
import pickle
from matplotlib import pyplot
import numpy as np
X = pickle.load(open("data_ddslps_x.obj","rb"))
Y = pickle.load(open("data_ddslps_y.obj","rb"))
mod=dd.model.ddspls(X,Y,lambd=0.7,R=2,mode='reg',verbose=True)
Y_est_reg = mod.predict(X)
err = sklm.mean_squared_error(Y,Y_est_reg)
pyplot.scatter(Y, Y_est_reg, c = 'red',marker='.')
pyplot.title('MSE='+str(err))
perf_model_reg = dd.model.perf_ddspls(X,Y,R=2,kfolds=10,lambd_min=0.6,n_lambd=20,NCORES=4,mode="reg")
pyplot.plot(perf_model_reg[:,1],perf_model_reg[:,2], linestyle = 'solid')
pyplot.title('10-folds Cross-validation error against $\lambda$')
pyplot.xlabel('$\lambda$')
pyplot.ylabel('RMSE')
pyplot.show()
And what are those data ;) ?
The data are some econometric indices; the 2 components of the model are supposed to be supply and demand. I use python 3.6, also on Spyder, with numpy 1.18.1
Can you try and upgrade to 3.7 ? I am very not sure this is the solution but this can be. Very informational data
Le ven. 13 mars 2020 07:18, phscha notifications@github.com a écrit :
The data are some econometric indices; the 2 components of the model are supposed to be supply and demand. I use python 3.6, also on Spyder, with numpy 1.18.1
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hlorenzo/py_ddspls/issues/1#issuecomment-598569202, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG33PEK5USXUQHRA2Z6EQTRHHF4RANCNFSM4LD4WTBQ .
Same error with py 3.7 :( Minimal script:
import py_ddspls as dd
import pickle
X = pickle.load(open("data_ddslps_x.obj","rb"))
Y = pickle.load(open("data_ddslps_y.obj","rb"))
mod=dd.model.ddspls(X,Y,lambd=0.7,R=2,mode='reg',verbose=True)
Actually you tried to build 2 components while the python version of the package does not handle deflation and q=1
(the number of response variables). Maybe you can try with R=1
it works with R=1; but with a one component, PLS is of little use for my application... Thank you anyway for your time and help!
It would be of interst for me to know why, do you have a minute to explain it to me ?
The target variable is related to a commodity price. I'd like to model it as a sum of two components: supply and demand. Each of them is a linear combination of a bunch of input variables. Many of those variables are irrelevant, that's why sparse PLS. Essentially, I'm not using PLS to predict anything, but to decompose a multi-dimensional signal into a two-dimensional one
This is interesting, I tried it on the R
version of the package.
It built two components where component
I do not know if this is meanningfull for you but you can give it a look
In python, I saved dataset in csv
data = np.concatenate((Y,X[0]),axis=1)
import pandas as pd
pd.DataFrame(data).to_csv("data_issue_1.csv")
In R I opened the dataset and started cross-validation. Parameter deflat=T
allows to activate deflation. mu
is a Ridge-like parameter, which can be set to 0 due to the structure of the data.
install.packages("ddsPLS")
library(ddsPLS)
data <- read.csv("data_issue_1.csv")
X <- scale(data[,-1])
Y <- scale(data[,1,drop=F])
colnames(X) <- 1:ncol(X)
cv <- perf_mddsPLS(X,Y,deflat = T,R = 2,L0 = 1:20,NCORES = 7,mu=0)
mod <- mddsPLS(X,Y,R = 2,L0=2,mu = 0,deflat = T)
print(mod$var_selected[[1]][,1:2])
Weights_comp_1 Weights_comp_2 11 -0.7270015 0 21 -0.6866359 0 1 0.0000000 1
Maybe it is intersting for you
Thank you very much! I'm looking forward to your python package having the same fnctionality as the one in R
Dear phscha, A new version of py_ddspls has bean released (version 1.1.1) taking into account deflation and Ridge regularization. Thanks to this new version, I can show you those figures:
One the first figure you can see the RMSEP (kfolds=10
) for R=1
(blue), R=2
(red) and R=3
(green). You can notice that errors are similar for \lambda>~0.67 for the 3 solutions, while error is largely higher for R=1
as soon as \lambda<~0.67 and also higher for R=2
as soon as \lambda<~0.45.
One the second figure are given the norms of the 1st (black) 2nd (green) and 3rd (purple) component for the model with R=3
. You can see that \lambda~0.67 corresponds to the annulation of the 2nd component while \lambda~0.45 to the annulation of the 3rd component.
According to that analysis it would be difficult to find uncorrelated components (2 components) describing that Y response better than a model with a single component unless you accept variables correlated with less than ~0.67 with the response. This corresponds to:
Variables selected on component 1 are number [23 24 29 34 45 46 48 49 53 54 55 56]
Thank you for your help PS: This issue has allowed me to fix a bug on the R package. Thank you for this too. PPS: Here is the code used (with version ddspls >1.1.1) :
import py_ddspls as dd
import pickle
from matplotlib import pyplot
import numpy as np
# Open data
X = pickle.load(open("data_ddslps_x.obj","rb"))
Y = pickle.load(open("data_ddslps_y.obj","rb"))
# 10-folds cross-validation on 7 cpus
R = 2
n_lambd = 20
kfolds = 10
lambd_min = 0
cv_R_1 = dd.model.perf_ddspls(X,Y,R=1,kfolds=kfolds,lambd_min=lambd_min,lambd_max=0.98,
n_lambd=n_lambd,NCORES=7,mode="reg",
deflat=True,mu=0.001)
cv_R_2 = dd.model.perf_ddspls(X,Y,R=2,kfolds=kfolds,lambd_min=lambd_min,lambd_max=0.98,
n_lambd=n_lambd,NCORES=7,mode="reg",
deflat=True,mu=0.001)
cv_R_3 = dd.model.perf_ddspls(X,Y,R=3,kfolds=kfolds,lambd_min=lambd_min,lambd_max=0.98,
n_lambd=n_lambd,NCORES=7,mode="reg",
deflat=True,mu=0.001)
x = cv_R_1[:,0]
vars_t = np.zeros((len(x),3))
for i in range(len(x)):
mod=dd.model.ddspls(X,Y,lambd=cv_R_1[i,1],R=3,mode='reg',deflat=True,mu=0.001)
vars_t[i,0] = np.linalg.norm(mod.model.ts[0])
vars_t[i,1] = np.linalg.norm(mod.model.ts[1])
vars_t[i,2] = np.linalg.norm(mod.model.ts[2])
# Variance annulations
annul_var_T2 = cv_R_1[np.where(vars_t[:,1]==0)[0][0],1]
annul_var_T3 = cv_R_1[np.where(vars_t[:,2]==0)[0][0],1]
# Plot
fig, axs = pyplot.subplots(nrows=2, ncols=1, constrained_layout=True)
fig.set_size_inches(5, 5)
axs[0].vlines(annul_var_T2 ,0.2,1,colors="gray")
axs[0].vlines(annul_var_T3 ,0.2,1,colors="gray")
axs[0].plot(cv_R_1[:,1],cv_R_1[:,2],label='No deflation, R=1',marker="X")
axs[0].plot(cv_R_2[:,1],cv_R_2[:,2],c='r', label='Deflation, R=2',marker='+')
axs[0].plot(cv_R_3[:,1],cv_R_3[:,2],c='green', label='Deflation, R=3',marker='$o$')
axs[0].legend(loc='best');
axs[0].set_xlabel("$\lambda$")
axs[0].set_title("RMSEP for R=1, R=2 and R=3\n Cross-Validation")
axs[1].hlines(0,np.min(cv_R_1[:,1]),np.max(cv_R_1[:,1]),colors="gray")
axs[1].vlines(annul_var_T2 ,-5,120,colors="gray")
axs[1].vlines(annul_var_T3 ,-5,120,colors="gray")
axs[1].plot(cv_R_1[:,1],vars_t[:,0],label='First comp',marker='X',c="black")
axs[1].plot(cv_R_2[:,1],vars_t[:,1],label='Second comp',c="g",marker='+')
axs[1].plot(cv_R_3[:,1],vars_t[:,2],label='Third comp',c="purple",marker='1')
axs[1].set_title("Norms of components for R=3\n Model")
axs[1].legend(loc='best');
axs[1].set_xlabel("$\lambda$")
# Model
mod=dd.model.ddspls(X,Y,lambd=annul_var_T2,R=1,mode='reg',deflat=True,mu=0.001)
u = mod.model.u[0]
var_select_comp_1 = np.where(u[:,0]!=0)[0]
print("Variables selected on component 1 are number "+str(var_select_comp_1+1))
Thank you so much!
Dear phscha,
I just modified the version of the package (upgrade to 1.1.1 please) to properly handle deflation. And I modified the previous Comment.
I found finaly strange that R=2
did worse than R=1
for all \lambda.
As you can see, it is now in order, or I hope so. And I have also added R=3
to generalize the idea.
Once again do not hesitate to give any advice,
Thank you again,
Thanks a bunch! Where did you publish the new release? On pypi, I can only see 1.0.9991 (October 2018).
I cannot currently work on Pypi with my computer so it is only on GitHub : https://github.com/hlorenzo/py_ddspls/releases/tag/v_111
I think you can install the new version within a terminal through:
pip install --upgrade https://github.com/hlorenzo/py_ddspls/archive/v_111.tar.gz
I hope this helps you,
Found a way to update Pypi repository, you can now have it on https://pypi.org/project/py-ddspls/ and download with.
pip install py-ddspls
I will release a new version in the week, but not much differences I think, but I you think any inconvenient fact let me know.
See you!
Thank you for submitting this promising public package!
I'm trying to use it with the following command: mod=dd.model.ddspls(Xnorm,Ynorm,lambd=0.7,R=2,mode='reg',verbose=True)
where Xnorm is a 888x62 dense matrix and Ynorm a 88x1 dense matrix, both normalized to zero mean and unit variance. I get the following output, which does not depend on the values of lamb and R:
Traceback (most recent call last):
File "", line 1, in
mod=dd.model.ddspls(Xnorm.values,Ynorm.to_frame().values,lambd=0.7,R=2,mode='reg',verbose=True)
File "C:\Users\Philipp\Anaconda3\lib\site-packages\py_ddspls\model.py", line 250, in init self.getModel(model)
File "C:\Users\Philipp\Anaconda3\lib\site-packages\py_ddspls\model.py", line 278, in getModel mod = MddsPLS_core(Xs_w,Y,lambd=lambd,R=R,mode=mode,verbose=verbose)
File "C:\Users\Philipp\Anaconda3\lib\site-packages\py_ddspls\model.py", line 160, in MddsPLS_core svd_k = {"v":v_k_res[:,range(R_w)]}
IndexError: index 1 is out of bounds for axis 1 with size 1