mvlearn / mvlearn

Python package for multi-view machine learning
https://mvlearn.github.io/
MIT License
201 stars 21 forks source link

Code never considers the unlabelled pool size I keep on getting the same confusion matrix even for different unlabelled pool size #299

Open neel6762 opened 2 years ago

neel6762 commented 2 years ago

Expected Behavior

Tell us what should happen

Actual Behavior

Tell us what happens instead

Template Code

Paste the template code (ideally a minimal example) that causes the issue

Full Traceback

Paste the full traceback in case there is an exception

Your Environment

gavinmischler commented 2 years ago

Could you post a code snippet so we can reproduce your issue? This is about CTClassifier, right? In the CTClassifier code, it does make use of the unlabeled_pool_size input, but it may just be that with your specific data it always converges to the same classifications no matter what this is set as.

neel6762 commented 2 years ago

Thank you, It might be that case, but even when I individually check the predict proba score they are actually high. I might be implementing it in a wrong way if you can please verify. In addition to this, there is very limited documentation for that. How to use the unlabelled data in this scenario. This is how I am trying based on the original paper.

unlabelled_data_ind = round(len(df) 0.20) train_ind = unlabelled_data_ind + round(len(df) 0.60) test_ind = train_ind + round(len(df) * 0.20) unlabelled = df.iloc[:unlabelled_data_ind] train = df.iloc[unlabelled_data_ind:train_ind] test = df.iloc[train_ind:]

assigning the data to train, test and unlabelled sets

X_train = train.drop('target', axis=1) y_train = train['target']

X_test = test.drop('target', axis=1) y_test = test['target']

X_unlabeled = unlabelled.drop('target', axis=1) y_unlabeled = unlabelled['target'] y_unlabeled.loc[:]= None

X_train = pd.concat([X_train,X_unlabeled],axis=0) y_train = pd.concat([y_train,y_unlabeled],axis=0)

n_features = 27 X_train1 = X_train.iloc[:,:n_features // 2] X_train2 = X_train.iloc[:, n_features // 2:] X_test1 = X_test.iloc[:,:n_features // 2] X_test2 = X_test.iloc[:, n_features // 2:] estimator1 = RandomForestClassifier(n_estimators=100, bootstrap = True,max_features = 'sqrt') estimator2 = SVC(kernel='linear',probability = True) ctc = CTClassifier(estimator1, estimator2, random_state=1,unlabeled_pool_size=30) ctc = ctc.fit([X_train1, X_train2], y_train) preds = ctc.predict([X_test1, X_test2]) predict_proba_values = ctc.predict_proba([X_test1, X_test2]) print("f1-score:",f1_score(y_test, preds)) print("accuracy:",accuracy_score(y_test, preds)) print("Confusion Matrix") print(confusion_matrix(y_test,preds))

Plotting ROC Curve

ctc_fpr,ctc_tpr = plot_roc(y_test, predict_proba_values)

the confusion matrix: [[42, 18] [0,0]]

Help would be good

RavishaSharma commented 2 years ago

Based upon my understanding, for the semi supervised scenario, the parameters to ct.fit() should be : 1) the two data views on which two classifiers train and 2) the parameters 'y' which is a combination of unlabelled and labelled target class. These parameters were set accordingly. Also, all other functions were passed parameters correctly (ctc.predict and ctc.predict_proba) and they were in accordance with the original paper on Semi Supervised Co training. I am not sure why the results were incorrect. Any help or guidance will be appreciated as there is very limited information available for implementing this for semi - supervised scenario.

gavinmischler commented 2 years ago

It's a little hard to debug without your actual data, but a couple things: