suiji / Arborist

Scalable decision tree training and inference.
Other
82 stars 14 forks source link

yPred from predict is not the class label #41

Closed gse-cc-git closed 6 years ago

gse-cc-git commented 6 years ago

Hello, I am wondering if it's the intended result, or have I misunderstood what yPred really is ? I was caught using yPred as the class label prediction which seems not to be the case, below a MRE

# From package help:
library(Rborist)
# Classification example:
data(iris)
# Generic invocation:
rb <- Rborist(iris[,-5], iris[,5])
pred <- predict(rb, iris[,-5], ctgCensus = "prob")
yPred <- pred$yPred

yPred
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
[55] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
[109] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Training is made using column nr 5 Species as target variable.

levels(iris[,5])
[1] "setosa"     "versicolor" "virginica"
yPred
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
[56] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
[111] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Prediction returns numerics. From that, I understand that 1 corresponds to level setosa, 2 to level versicolor and 3 to virginica.

What if I encode the levels as numerics?

  iris_mod <- iris %>%
  mutate(species_num = as.factor(as.numeric(Species)))

rb_with <- Rborist(iris_mod[,-c(5,6)], iris_mod$species_num)

rb_with$validation$yPred
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[67] 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3
[133] 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

pred_with <- predict(rb_with, iris_mod[,-c(5,6)])
pred_with$yPred
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
[56] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
[111] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
# Missing level 2 (versicolor)
iris_mod2 <- iris %>%
  filter(!Species %in% 'versicolor') %>%
  mutate(species_num = as.factor(as.numeric(Species)))

levels(iris_mod2$species_num)
[1] "1" "3"

Class label "2" is missing from the training dataset and thus cannot be predicted.

rb_without <- Rborist(iris_mod2[,-c(5,6)], iris_mod2$species_num, ctgCensus = "prob")

# Level "2" in yPred ?
rb_without$validation$yPred
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
[67] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

# Labels are Ok
rb_without$validation$confusion
1  3
1 50  0
3  1 49

# Labels are Ok
head(rb_without$validation$prob)
1            3
[1,] 1.000000 0.0000000000
[2,] 1.000000 0.0000000000
[3,] 1.000000 0.0000000000
[4,] 1.000000 0.0000000000
[5,] 1.000000 0.0000000000
[6,] 0.999098 0.0009020076

pred_without <- predict(rb_without, iris[,-5], ctgCensus = "prob")

# Levels "2" in yPred ?
pred_without$yPred
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
[56] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
[111] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

# Labels are Ok
head(pred_without$census)
1 3
[1,] 500 0
[2,] 500 0
[3,] 500 0
[4,] 500 0
[5,] 500 0
[6,] 500 0

# Labels are Ok
head(pred_without$prob)
[1,] 0.9999556 4.444507e-05
[2,] 0.9999556 4.444507e-05
[3,] 0.9999556 4.444507e-05
[4,] 0.9999556 4.444507e-05
[5,] 0.9999556 4.444507e-05
[6,] 0.9983056 1.694357e-03

But that's ok, I can get the predicted class label returned using

colnames(pred_without$prob)[apply(pred_without$prob, 1, which.max)]
  [1] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
 [34] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3"
[67] "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "1"
[100] "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3"
[133] "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3"

Thanks.

suiji commented 6 years ago

yPred has integer type, not numeric. The census and confusion matrices are decorated with the class labels, although not the inferred response. As you note, the inferred factor levels employ the same mapping as those used to train. The level-to-string mapping is available from the trained object, and should be applied when attempting to reconcile separately-trained cases. Perhaps we should consider offering the decorations automatically. FWIW, when performing inference with differing _predictor_factor levels, appropriate adjustments are made internally, with warnings issued when appropriate. Closing this, but please feel free to reopen if there is more to discuss.

suiji commented 6 years ago

Inferred and trained response now include the level decorations. Thank you for the suggestion.