mabelc / SSC

The ssc R package
9 stars 5 forks source link

Method to retrieve predictions from SSC #4

Open jllavin77 opened 2 years ago

jllavin77 commented 2 years ago

Dear developers,

I was looking for a Semi-Supervised ML method in R and found your excellent package. I tried your example code adapting it to my input data, and after some reformating it works apparently well. The problem I have is related to how to access prediction results for each of the rows in my input table. I may sound naive, but I can't find the code to access the classification assigned for each of the "unlabeled" rows in my table, by any of the methods carried out in your vignette's example code. I can access the sumary of how many samples have been assigned to each class, but I'd like to know how to access to each row's individual class/label prediction (in dataframe format, for instance). I hope I was able to explain myself clearly enough for everybody to understand this request. Thanks in advance and congrats for your nice work.

mabelc commented 2 years ago

Thanks for your interest! Please, use the predict method and supply the instances that were unlabeled during the training. That way you are using the transductive capabilities of the model because those instances were also seen during the training. Hope I helped. If you still have questions don't hesitate to ask.

jllavin77 commented 2 years ago

My question is more related to having a function to obtain that information in table format. Using predict doesn't provide that info. You suggest to use predict on my unlabeled data, but, which model should I use for that prediction? Could you provide an code example on that? Is it somethig similar to this snipet?

`######################REDUCED CODE######################

m <- selfTraining(x = xtrain, y = ytrain, learner = knn3, learner.pars = list(k = 1))

pred <- predict(m, xitest, interval="confidence")

summary(pred) `

Once I carry out this prediction, how do I get the data I'm really looking for, because this way I end up with a summary of the predictions, but no clue about which label corresponds to each row. Do you see what I mean?

mabelc commented 2 years ago

I think I understand what you are looking for. Could you please try this code? But if it is not solving your problem, please continue asking!

Load Iris data set

data(iris)

x <- iris[, -5] # instances without classes x <- as.matrix(x) y <- iris$Species

Prepare data, use 50% of instances for training

set.seed(1) tra.idx <- sample(x = length(y), size = ceiling(length(y) * 0.5)) xtrain <- x[tra.idx,] # training instances ytrain <- y[tra.idx] # classes of training instances

Use 70% of train instances as unlabeled set

tra.na.idx <- sample(x = length(tra.idx), size = ceiling(length(tra.idx) * 0.7)) ytrain[tra.na.idx] <- NA # remove class information of unlabeled instances

train selftraining with base classifiers knn3

m <- selfTraining(x = xtrain, y = ytrain, learner = knn3, learner.pars = list(k = 1))

transductive test

it's called transductive because we want to predict the instances that were unlabeled during the training

xttest = xtrain[tra.na.idx,] pred.label <- predict(m, xttest)

creating a matrix with the training data unlabeled + predicted labels by selftraining-knn3

xttest <- cbind(xttest, pred.label) xttest

jllavin77 commented 2 years ago

Dear @mabelc,

Thank you very much for your piece of code. It works, and was exactly what I was asking for.

Just one more question, I have read the selfTraining function documentation and cannot figure out how to change the learner parameter from KNN3 to random forest, svm or any other classifier. Is there a list of the available classifiers explained somewhere?

Thanks in advance for your kind help.

mabelc commented 2 years ago

Hi,

In this paper https://cran.r-project.org/web/packages/ssc/vignettes/ssc.pdf you can find many examples with different learners. I have modified the previous example to use SVM as learner. Basically you can use learners from R ecosystem, the generic functions provided will help you with that. In the example I am using the generic version of selfTraining, named selfTrainingG.

library('ssc') library('e1071')

Load Iris data set

data(iris)

x <- iris[, -5] # instances without classes x <- as.matrix(x) y <- iris$Species

Prepare data, use 50% of instances for training

set.seed(1) tra.idx <- sample(x = length(y), size = ceiling(length(y) * 0.5)) xtrain <- x[tra.idx,] # training instances ytrain <- y[tra.idx] # classes of training instances

Use 70% of train instances as unlabeled set

tra.na.idx <- sample(x = length(tra.idx), size = ceiling(length(tra.idx) * 0.7)) ytrain[tra.na.idx] <- NA # remove class information of unlabeled instances

wrapper functions to train a SVM

gen.learner <- function(indexes, cls) e1071::svm(x = xtrain[indexes, ], y = cls, type='C-classification', probability=TRUE)

gen.pred <- function(model, indexes){ p <- predict(model, xtrain[indexes, ], probability=TRUE) attr(p, "probabilities") }

train generic selftraining with SVM as base classifier

m <- selfTrainingG(y = ytrain, gen.learner, gen.pred)

transductive test

it's called transductive because we want to predict the instances that were unlabeled during the training

xttest = xtrain[tra.na.idx,] pred.label <- predict(m$model, xttest)

creating a matrix with the training data unlabeled + predicted labels by selftraining-knn3

xttest <- cbind(xttest, pred.label) xttest