hlorenzo / py_ddspls

Multi (& Mono) Data-Driven Sparse PLS
MIT License
4 stars 2 forks source link

Getting a cryptic error when trying to use ddspls #1

Closed phscha closed 4 years ago

phscha commented 4 years ago

Thank you for submitting this promising public package!

I'm trying to use it with the following command: mod=dd.model.ddspls(Xnorm,Ynorm,lambd=0.7,R=2,mode='reg',verbose=True)

where Xnorm is a 888x62 dense matrix and Ynorm a 88x1 dense matrix, both normalized to zero mean and unit variance. I get the following output, which does not depend on the values of lamb and R:

Traceback (most recent call last):

File "", line 1, in mod=dd.model.ddspls(Xnorm.values,Ynorm.to_frame().values,lambd=0.7,R=2,mode='reg',verbose=True)

File "C:\Users\Philipp\Anaconda3\lib\site-packages\py_ddspls\model.py", line 250, in init self.getModel(model)

File "C:\Users\Philipp\Anaconda3\lib\site-packages\py_ddspls\model.py", line 278, in getModel mod = MddsPLS_core(Xs_w,Y,lambd=lambd,R=R,mode=mode,verbose=verbose)

File "C:\Users\Philipp\Anaconda3\lib\site-packages\py_ddspls\model.py", line 160, in MddsPLS_core svd_k = {"v":v_k_res[:,range(R_w)]}

IndexError: index 1 is out of bounds for axis 1 with size 1

hlorenzo commented 4 years ago

Dear phscha, Thank you for interest for that work. It is indeed a young project. You said Xnorm is a 888x62 dense matrix and Ynorm a 88x1 dense matrix Did you mean that Xnorm and Ynorm did not have the same number of individuals (which is supposed to be in the rows in that software) ? Also, the current version deals only with dictionnaries in covariates. Please retry with something like X_dict = {0:Xnorm} mod=dd.model.ddspls(X_dict,Y,...) I hope this will help, feel free to ask again!

phscha commented 4 years ago

Dear Hadrien, thanks a lot for the fast response! X and Y have the same number of rows (888). There was a typo in my first message. I tried your suggestion with X_dict = {0:Xnorm}. The result is the same error message as in the original post.

hlorenzo commented 4 years ago

Are you sure that Xnorm and Ynorm are numpy arrays ? Do you think you could send me just an extract of your tables ? Like 10 rows and a few columns ? Bests,

phscha commented 4 years ago

Please find the pickled pair (X_dict, Y_mat) attached.

data_ddspls.zip

hlorenzo commented 4 years ago

I do not know the .obj format. Would you give me a script to open it ? Bests, Hadrien

Le mer. 11 mars 2020 à 21:43, phscha notifications@github.com a écrit :

Please find the pickled pair (X_dict, Y_mat) attached.

data_ddspls.zip https://github.com/hlorenzo/py_ddspls/files/4320795/data_ddspls.zip

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hlorenzo/py_ddspls/issues/1#issuecomment-597869100, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG33PHPLRITWXUWYHNYMJTRG7ZW5ANCNFSM4LD4WTBQ .

phscha commented 4 years ago

Sorry, my bad. I've pickled python variables; you can load them using

import pickle
X = pickle.load(open('data_ddslps_x.obj','rb'))
Y = pickle.load(open('data_ddslps_y.obj','rb'))
...
hlorenzo commented 4 years ago

Thank you for the script I started it on Python 3.7 using Spyder. It seemed to work well in my case, see below. Which python do you use ?

import py_ddspls as dd
import sklearn.metrics as sklm
import pickle
from matplotlib import pyplot
import numpy as np

X = pickle.load(open("data_ddslps_x.obj","rb"))
Y = pickle.load(open("data_ddslps_y.obj","rb"))
mod=dd.model.ddspls(X,Y,lambd=0.7,R=2,mode='reg',verbose=True)
Y_est_reg = mod.predict(X)
err = sklm.mean_squared_error(Y,Y_est_reg)
pyplot.scatter(Y, Y_est_reg, c = 'red',marker='.')
pyplot.title('MSE='+str(err))

py_issue_1_1

perf_model_reg = dd.model.perf_ddspls(X,Y,R=2,kfolds=10,lambd_min=0.6,n_lambd=20,NCORES=4,mode="reg")
pyplot.plot(perf_model_reg[:,1],perf_model_reg[:,2], linestyle = 'solid')
pyplot.title('10-folds Cross-validation error against $\lambda$')
pyplot.xlabel('$\lambda$')
pyplot.ylabel('RMSE')
pyplot.show()

py_issue_1_2

hlorenzo commented 4 years ago

And what are those data ;) ?

phscha commented 4 years ago

The data are some econometric indices; the 2 components of the model are supposed to be supply and demand. I use python 3.6, also on Spyder, with numpy 1.18.1

hlorenzo commented 4 years ago

Can you try and upgrade to 3.7 ? I am very not sure this is the solution but this can be. Very informational data

Le ven. 13 mars 2020 07:18, phscha notifications@github.com a écrit :

The data are some econometric indices; the 2 components of the model are supposed to be supply and demand. I use python 3.6, also on Spyder, with numpy 1.18.1

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hlorenzo/py_ddspls/issues/1#issuecomment-598569202, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACG33PEK5USXUQHRA2Z6EQTRHHF4RANCNFSM4LD4WTBQ .

phscha commented 4 years ago

Same error with py 3.7 :( Minimal script:

import py_ddspls as dd
import pickle

X = pickle.load(open("data_ddslps_x.obj","rb"))
Y = pickle.load(open("data_ddslps_y.obj","rb"))
mod=dd.model.ddspls(X,Y,lambd=0.7,R=2,mode='reg',verbose=True)
hlorenzo commented 4 years ago

Actually you tried to build 2 components while the python version of the package does not handle deflation and q=1 (the number of response variables). Maybe you can try with R=1

phscha commented 4 years ago

it works with R=1; but with a one component, PLS is of little use for my application... Thank you anyway for your time and help!

hlorenzo commented 4 years ago

It would be of interst for me to know why, do you have a minute to explain it to me ?

phscha commented 4 years ago

The target variable is related to a commodity price. I'd like to model it as a sum of two components: supply and demand. Each of them is a linear combination of a bunch of input variables. Many of those variables are irrelevant, that's why sparse PLS. Essentially, I'm not using PLS to predict anything, but to decompose a multi-dimensional signal into a two-dimensional one

hlorenzo commented 4 years ago

This is interesting, I tried it on the R version of the package. It built two components where component

I do not know if this is meanningfull for you but you can give it a look

In python, I saved dataset in csv

data = np.concatenate((Y,X[0]),axis=1)
import pandas as pd 
pd.DataFrame(data).to_csv("data_issue_1.csv")

In R I opened the dataset and started cross-validation. Parameter deflat=T allows to activate deflation. mu is a Ridge-like parameter, which can be set to 0 due to the structure of the data.

install.packages("ddsPLS")
library(ddsPLS)
data <- read.csv("data_issue_1.csv")
X <- scale(data[,-1])
Y <- scale(data[,1,drop=F])
colnames(X) <- 1:ncol(X)
cv <- perf_mddsPLS(X,Y,deflat = T,R = 2,L0 = 1:20,NCORES = 7,mu=0)

ddspls_eco_issue_1

mod <- mddsPLS(X,Y,R = 2,L0=2,mu = 0,deflat = T)
print(mod$var_selected[[1]][,1:2])

Weights_comp_1 Weights_comp_2 11 -0.7270015 0 21 -0.6866359 0 1 0.0000000 1

Maybe it is intersting for you

phscha commented 4 years ago

Thank you very much! I'm looking forward to your python package having the same fnctionality as the one in R

hlorenzo commented 4 years ago

Dear phscha, A new version of py_ddspls has bean released (version 1.1.1) taking into account deflation and Ridge regularization. Thanks to this new version, I can show you those figures:

save

One the first figure you can see the RMSEP (kfolds=10) for R=1 (blue), R=2 (red) and R=3 (green). You can notice that errors are similar for \lambda>~0.67 for the 3 solutions, while error is largely higher for R=1 as soon as \lambda<~0.67 and also higher for R=2as soon as \lambda<~0.45. One the second figure are given the norms of the 1st (black) 2nd (green) and 3rd (purple) component for the model with R=3. You can see that \lambda~0.67 corresponds to the annulation of the 2nd component while \lambda~0.45 to the annulation of the 3rd component.

According to that analysis it would be difficult to find uncorrelated components (2 components) describing that Y response better than a model with a single component unless you accept variables correlated with less than ~0.67 with the response. This corresponds to:

Variables selected on component 1 are number [23 24 29 34 45 46 48 49 53 54 55 56]

Thank you for your help PS: This issue has allowed me to fix a bug on the R package. Thank you for this too. PPS: Here is the code used (with version ddspls >1.1.1) :

import py_ddspls as dd
import pickle
from matplotlib import pyplot
import numpy as np
# Open data
X = pickle.load(open("data_ddslps_x.obj","rb"))
Y = pickle.load(open("data_ddslps_y.obj","rb"))
# 10-folds cross-validation on 7 cpus
R = 2
n_lambd = 20
kfolds = 10
lambd_min = 0
cv_R_1 = dd.model.perf_ddspls(X,Y,R=1,kfolds=kfolds,lambd_min=lambd_min,lambd_max=0.98,
                                      n_lambd=n_lambd,NCORES=7,mode="reg",
                                      deflat=True,mu=0.001)
cv_R_2 = dd.model.perf_ddspls(X,Y,R=2,kfolds=kfolds,lambd_min=lambd_min,lambd_max=0.98,
                                      n_lambd=n_lambd,NCORES=7,mode="reg",
                                      deflat=True,mu=0.001)
cv_R_3 = dd.model.perf_ddspls(X,Y,R=3,kfolds=kfolds,lambd_min=lambd_min,lambd_max=0.98,
                                      n_lambd=n_lambd,NCORES=7,mode="reg",
                                      deflat=True,mu=0.001)
x = cv_R_1[:,0]
vars_t = np.zeros((len(x),3))
for i in range(len(x)):
    mod=dd.model.ddspls(X,Y,lambd=cv_R_1[i,1],R=3,mode='reg',deflat=True,mu=0.001)
    vars_t[i,0] = np.linalg.norm(mod.model.ts[0])
    vars_t[i,1] = np.linalg.norm(mod.model.ts[1])
    vars_t[i,2] = np.linalg.norm(mod.model.ts[2])
# Variance annulations
annul_var_T2 = cv_R_1[np.where(vars_t[:,1]==0)[0][0],1]
annul_var_T3 = cv_R_1[np.where(vars_t[:,2]==0)[0][0],1]
# Plot
fig, axs = pyplot.subplots(nrows=2, ncols=1, constrained_layout=True)
fig.set_size_inches(5, 5)
axs[0].vlines(annul_var_T2 ,0.2,1,colors="gray")
axs[0].vlines(annul_var_T3 ,0.2,1,colors="gray")
axs[0].plot(cv_R_1[:,1],cv_R_1[:,2],label='No deflation, R=1',marker="X")
axs[0].plot(cv_R_2[:,1],cv_R_2[:,2],c='r',  label='Deflation, R=2',marker='+')
axs[0].plot(cv_R_3[:,1],cv_R_3[:,2],c='green',  label='Deflation, R=3',marker='$o$')
axs[0].legend(loc='best');
axs[0].set_xlabel("$\lambda$")
axs[0].set_title("RMSEP for R=1, R=2 and R=3\n Cross-Validation")
axs[1].hlines(0,np.min(cv_R_1[:,1]),np.max(cv_R_1[:,1]),colors="gray")
axs[1].vlines(annul_var_T2 ,-5,120,colors="gray")
axs[1].vlines(annul_var_T3 ,-5,120,colors="gray")
axs[1].plot(cv_R_1[:,1],vars_t[:,0],label='First comp',marker='X',c="black")
axs[1].plot(cv_R_2[:,1],vars_t[:,1],label='Second comp',c="g",marker='+')
axs[1].plot(cv_R_3[:,1],vars_t[:,2],label='Third comp',c="purple",marker='1')
axs[1].set_title("Norms of components for R=3\n Model")
axs[1].legend(loc='best');
axs[1].set_xlabel("$\lambda$")
# Model
mod=dd.model.ddspls(X,Y,lambd=annul_var_T2,R=1,mode='reg',deflat=True,mu=0.001)
u = mod.model.u[0]
var_select_comp_1 = np.where(u[:,0]!=0)[0]
print("Variables selected on component 1 are number "+str(var_select_comp_1+1))
phscha commented 4 years ago

Thank you so much!

hlorenzo commented 4 years ago

Dear phscha, I just modified the version of the package (upgrade to 1.1.1 please) to properly handle deflation. And I modified the previous Comment. I found finaly strange that R=2 did worse than R=1 for all \lambda. As you can see, it is now in order, or I hope so. And I have also added R=3 to generalize the idea. Once again do not hesitate to give any advice, Thank you again,

phscha commented 4 years ago

Thanks a bunch! Where did you publish the new release? On pypi, I can only see 1.0.9991 (October 2018).

hlorenzo commented 4 years ago

I cannot currently work on Pypi with my computer so it is only on GitHub : https://github.com/hlorenzo/py_ddspls/releases/tag/v_111

I think you can install the new version within a terminal through: pip install --upgrade https://github.com/hlorenzo/py_ddspls/archive/v_111.tar.gz

I hope this helps you,

hlorenzo commented 4 years ago

Found a way to update Pypi repository, you can now have it on https://pypi.org/project/py-ddspls/ and download with. pip install py-ddspls

I will release a new version in the week, but not much differences I think, but I you think any inconvenient fact let me know.

See you!