lasso-net / lassonet

Feature selection in neural networks
MIT License
215 stars 52 forks source link

lassonet as unsupervised feature selection algorithm (LassoNetAutoEncoder does not exist !) #25

Closed B-Seif closed 2 years ago

B-Seif commented 2 years ago

Hi, I would like to use lassonet as an unsupervised feature selection algorithm, but I can't find an example that shows how to do this in a simple way. The only script that shows an example rebuild is the minst_ae.py but it doesn't work ( I have an error : LassoNetAutoEncoder does not exist ! .

My use case: I have an input matrix without labels, and I want to have a new reduced matrix with only 30% of the important features.

louisabraham commented 2 years ago

Have you looked at mnist_reconstruction.py? https://github.com/lasso-net/lassonet/blob/master/examples/mnist_reconstruction.py

B-Seif commented 2 years ago

it corresponds to what I'm looking for but unfortunately I can't get the results since I have this error.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [374], in <cell line: 1>()
----> 1 path = model.path(X_train, X_train)

File ~/anaconda3/envs/ieee_vit/lib/python3.9/site-packages/lassonet/interfaces.py:468, in BaseLassoNet.path(self, X, y, X_val, y_val, lambda_seq, lambda_max, return_state_dicts, callback)
    465     is_dense = False
    466     if current_lambda / lambda_start < 2:
    467         warnings.warn(
--> 468             f"lambda_start={self.lambda_start:.3f} "
    469             "might be too large.\n"
    470             f"Features start to disappear at {current_lambda=:.3f}."
    471         )
    473 hist.append(last)
    474 if callback is not None:

ValueError: Unknown format code 'f' for object of type 'str'

Do you have an idea ? I'm using Python 3.9.12 and the code that generated the error is:

model = LassoNetRegressor(M=30, n_iters=(300,500), path_multiplier=1.05)
path = model.path(X_train, X_train)

Thanks

louisabraham commented 2 years ago

My last fix for #18 was not working properly. Can you try again after updating lassonet to the latest version?

B-Seif commented 2 years ago

How can I get the latest version of lassonet? maybe by using pip install lassonet? I did that and the error persists. Do you have an idea to overcome this problem?

louisabraham commented 2 years ago

pip install -U lassonet

U is for upgrade :)

B-Seif commented 2 years ago

I still have the same error :(

louisabraham commented 2 years ago

Can you paste the error? I literally changed the code https://github.com/lasso-net/lassonet/commit/d76a32641421e01952529dead88836dfd5ae58bd

B-Seif commented 2 years ago

ok I get this error :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [419], in <cell line: 2>()
      1 model = LassoNetRegressor(M=30, n_iters=(300,500), path_multiplier=1.05)
----> 2 path = model.path(X_train, X_train)

File ~/anaconda3/envs/ieee_vit/lib/python3.9/site-packages/lassonet/interfaces.py:468, in BaseLassoNet.path(self, X, y, X_val, y_val, lambda_seq, lambda_max, return_state_dicts, callback)
    465     is_dense = False
    466     if current_lambda / lambda_start < 2:
    467         warnings.warn(
--> 468             f"lambda_start={self.lambda_start:.3f} "
    469             "might be too large.\n"
    470             f"Features start to disappear at {current_lambda=:.3f}."
    471         )
    473 hist.append(last)
    474 if callback is not None:

ValueError: Unknown format code 'f' for object of type 'str'
louisabraham commented 2 years ago

You clearly don't have the last version. Maybe uninstall lassonet and reinstall?

louisabraham commented 2 years ago

Maybe pip install "lassonet>=0.0.12" ?

B-Seif commented 2 years ago

I have the last version 0.0.12

louisabraham commented 2 years ago

Maybe you have a problem with your envs. I checked and the latest version looks like this

image
louisabraham commented 2 years ago

For the record, here is a minimal example for auto encoders:

from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler

from lassonet import LassoNetRegressor

X, _ = fetch_california_housing(return_X_y=True)
X = StandardScaler().fit_transform(X)

model = LassoNetRegressor(verbose=2)
path = model.path(X, X)
B-Seif commented 2 years ago

Hi, it works for me. I have another question regarding the use of Lassonet in supervised paradigm : could we use or adapt this algorithm in. the multilabel setting ?

louisabraham commented 2 years ago

Sure. You would need to inherit the base lassonet class and change the loss function as well as some casting functions.

Look at interfaces.py and how we implemented classifiers.

louisabraham commented 2 years ago

I would be happy to review a PR implementing Multilabel classification.

B-Seif commented 2 years ago

Hi, I run Lassonet to reconstruct some relatively large data (17000 x 8000). it runs for hours and after that the my terminal shows me that the process was killed. I also often get this warring message :

.local/lib/python3.9/site-packages/lassonet/interfaces.py:467: UserWarning: lambda_start=429496.730 (selected automatically) might be too large.
Features start to disappear at current_lambda=429496.730.
  warnings.warn(
Killed

do you have an idea to explain this? is there a hyper param that has an influence on this ? Below is my code:

    M = np.random.uniform(0,10_000)
    path_multiplier= np.random.uniform(1.01,1.5)
    hidden1 = np.random.randint(10,100)
    hidden2 = np.random.randint(100,200)
    start_time=time.time()
    model = LassoNetRegressor(M=M,path_multiplier = path_multiplier,hidden_dims=(hidden1,hidden2))
    path = model.path(X_train, X_train)
    tmp = time.time()-start_time
louisabraham commented 2 years ago

You should normalize your features first.

B-Seif commented 2 years ago

The data is already normalized !

B-Seif commented 2 years ago

Can I share with you the code and data to see ?

louisabraham commented 2 years ago

Sure, my email is on my personal page!

B-Seif commented 2 years ago

Ok, thanks

louisabraham commented 2 years ago
import pandas as pd
from lassonet import LassoNetRegressor
from lassonet import plot_path

from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# read data
X_train = pd.read_csv("eurlex-ev-fold1-train.arff.csv").iloc[:, : 8993 - 3993].values
X_train = StandardScaler().fit_transform(X_train)
# reconstruction
model = LassoNetRegressor(
    lambda_start=5e-1,
    path_multiplier=1.05,
    n_iters=(20, 5),
    verbose=2,
)
path = model.path(X_train, X_train)
plot_path(model, path, X_train, X_train)
plt.show()

Figure_1

louisabraham commented 2 years ago

I ran lassonet successfully using those parameters. I plotted the path of the training dataset so the performance is best with all the features.

In [12]: [(p.selected.sum().item(), p.val_loss)for p in path]
Out[12]: 
[(5000, 0.7403443455696106),
 (5000, 0.7388657331466675),
 (5000, 0.7372983694076538),
 (5000, 0.7356559634208679),
 (5000, 0.7339461445808411),
 (5000, 0.732173502445221),
 (5000, 0.7303400635719299),
 (5000, 0.728447437286377),
 (5000, 0.7264959216117859),
 (5000, 0.724486231803894),
 (5000, 0.7224189639091492),
 (5000, 0.7202948331832886),
 (5000, 0.7181147336959839),
 (5000, 0.715880274772644),
 (5000, 0.7135931849479675),
 (5000, 0.7112559676170349),
 (5000, 0.7088716626167297),
 (5000, 0.7064440846443176),
 (5000, 0.7039777636528015),
 (5000, 0.7014783620834351),
 (5000, 0.6989524960517883),
 (5000, 0.6964077949523926),
 (5000, 0.6938537359237671),
 (5000, 0.6913006901741028),
 (5000, 0.6887612342834473),
 (5000, 0.6862493753433228),
 (5000, 0.6837817430496216),
 (5000, 0.6813769936561584),
 (5000, 0.6790563464164734),
 (5000, 0.6768444776535034),
 (5000, 0.6747689843177795),
 (5000, 0.6728613376617432),
 (5000, 0.6711570024490356),
 (5000, 0.6696965098381042),
 (5000, 0.6685250997543335),
 (5000, 0.6676939725875854),
 (5000, 0.6672609448432922),
 (5000, 0.6672911047935486),
 (5000, 0.6678575277328491),
 (5000, 0.6690409779548645),
 (5000, 0.6709340214729309),
 (5000, 0.6736408472061157),
 (5000, 0.6772762537002563),
 (5000, 0.6819731593132019),
 (5000, 0.687872588634491),
 (5000, 0.6951332092285156),
 (5000, 0.7039183974266052),
 (5000, 0.714412271976471),
 (5000, 0.7267968654632568),
 (5000, 0.7412514090538025),
 (5000, 0.7578516602516174),
 (5000, 0.7759532332420349),
 (5000, 0.7900420427322388),
 (5000, 0.7914730906486511),
 (5000, 0.7910680770874023),
 (5000, 0.7905250787734985),
 (5000, 0.7899476289749146),
 (5000, 0.7894152402877808),
 (5000, 0.7889501452445984),
 (5000, 0.7885493636131287),
 (5000, 0.7882020473480225),
 (5000, 0.7878993153572083),
 (5000, 0.7876302003860474),
 (5000, 0.7873923182487488),
 (5000, 0.7871778607368469),
 (5000, 0.7869775295257568),
 (5000, 0.7867907285690308),
 (5000, 0.7866113185882568),
 (5000, 0.7864405512809753),
 (5000, 0.7862793803215027),
 (5000, 0.7861262559890747),
 (5000, 0.785978376865387),
 (5000, 0.7858325839042664),
 (5000, 0.785689651966095),
 (5000, 0.7855460047721863),
 (5000, 0.7854057550430298),
 (5000, 0.7852645516395569),
 (5000, 0.7851291298866272),
 (5000, 0.7849962115287781),
 (5000, 0.7848662734031677),
 (5000, 0.7847418189048767),
 (5000, 0.7846243381500244),
 (5000, 0.7845121622085571),
 (5000, 0.7844087481498718),
 (5000, 0.7843157052993774),
 (5000, 0.7842352986335754),
 (5000, 0.7841715812683105),
 (5000, 0.7841252684593201),
 (5000, 0.7840957045555115),
 (5000, 0.7840867042541504),
 (5000, 0.7841042876243591),
 (4999, 0.7841560244560242),
 (4998, 0.7842445373535156),
 (4988, 0.7843688130378723),
 (4967, 0.7845369577407837),
 (4916, 0.7847509384155273),
 (4835, 0.7850145101547241),
 (4691, 0.7853205800056458),
 (4458, 0.7856589555740356),
 (4089, 0.7860158681869507),
 (3566, 0.7863755822181702),
 (2730, 0.7866988182067871),
 (1757, 0.7869724631309509),
 (817, 0.7871779203414919),
 (266, 0.7872892022132872),
 (53, 0.7873258590698242),
 (11, 0.7873356938362122),
 (1, 0.7873363494873047),
 (0, 0.7873362302780151)]
louisabraham commented 2 years ago

Maybe you could run the code above but run the plot_path on a different fold. This way you will know if the model was simply overfitting. The validation loss (computed on a random subset of the data) indicates that the performance is poor.