paulvanderlaken / ppsr

R implementation of Predictive Power Score
GNU General Public License v3.0
74 stars 9 forks source link

F1 is incorrect when cv_folds > 1 #35

Open SamGG opened 3 years ago

SamGG commented 3 years ago

From my point of view, the factor transformation is handling the levels of y and yhat independently, which is incorrect. Could you check the F1 calculation and my commit https://github.com/paulvanderlaken/ppsr/commit/37c96920883138fcb60cf9ca3afe1f3c7ee469f2?

I think there should be a test case for F1 calculation

The Titanic dataset is interesting for tracking various combinations of variable type. I have no time to work on it now, but it might be included in the package as a demo file. I think there might be a problem with the TicketID variable as it has many levels, but I didn't how this handled in the Python code.

Best.

df = read.csv("https://raw.githubusercontent.com/8080labs/ppscore/master/examples/titanic.csv")
dim(df)
head(df)

# Preparation of the Titanic dataset
# - Selecting a subset of columns
# - Renaming the column names to be more clear
# - Changing some data types

df = df[,c("Survived", "Pclass", "Sex", "Age", "Ticket", "Fare", "Embarked")]
colnames(df) = c("Survived", "Class", "Sex", "Age", "TicketID", "TicketPrice", "Port")

sapply(df, class)

df = within(df, {
  Survived = factor(Survived)
  Class = factor(Class)
  Sex = factor(Sex)
  Port = factor(Port)
})
sapply(df, class)
sapply(df, table)
sapply(df, function(x) length(unique(x)))
paulvanderlaken commented 3 years ago

Thanks for pointing this out Samuel! Your help is greatly appreciated. I'll use your commit in the next update of the package. Please do continue to share and suggest any other improvements you spot!

SamGG commented 3 years ago

Thanks for your feedback. I have no time yet, and I only wanted to try the PPS idea on a dataset. I like this approach and I will come back later.