mlr-org / mlr3pipelines

Dataflow Programming for Machine Learning in R
https://mlr3pipelines.mlr-org.com/
GNU Lesser General Public License v3.0
141 stars 25 forks source link

Assertion on 'primary_key' failed: Contains duplicated values #646

Open wangbaili opened 2 years ago

wangbaili commented 2 years ago

Hello!

Thank you again for the R implementation of the mlr3.

I want to po the encode scale to the survival models(deepsurv),but have some trouble. this my codes

library(readxl)
library(mlr3)
library(mlr3benchmark)
library(mlr3cluster)
library(mlr3data)
library(mlr3filters)
library(mlr3fselect)
library(mlr3learners)
library(mlr3measures)
library(mlr3pipelines)
library(mlr3proba)
library(mlr3tuningspaces)
library(mlr3viz)
library(mlr3extralearners)
library(tableone)

es5<- read_excel("es5.xlsx")
es5[,7:19]<-lapply(seer5[,7:19],function(x)as.factor(as.character(x)))
task5<-TaskSurv$new("task5",es5, time = "time5", event = "status5")
resampling5 <- rsmp("bootstrap", ratio=0.7,repeats=3)

coder=po("encode", method = "treatment", affect_columns = selector_type("factor"))
scaler=po("scale",affect_columns = selector_type("numeric"))
learner_po = po("learner", lrn("surv.deepsurv", early_stopping =F,  optimizer = "adam",dropout=0.13866,learning_rate=0.3871,    alpha=0.160,num_nodes = c(169L, 169L,169L, 169L,169L, 169L,169L, 169L)))

graph=coder%>>%scaler%>>%learner_po

deepsurv5ln<- as_learner(graph)
resampling5 <- rsmp("bootstrap", ratio=0.7,repeats=3)
design <- benchmark_grid(task5, learners, resampling5)
bm <- benchmark(design)

when i ran bm ,get this error:

Error in as_data_backend.data.frame(data, primary_key = row_ids) : 
  Assertion on 'primary_key' failed: Contains duplicated values, position 2.
This happened PipeOp encode's $train()

I dont undertand this

Thanks again as I await your suggestion

mb706 commented 2 years ago

Hi, could you provide us with the column names of your es5 dataset?

What would help even more would be a minimal reproducible code example that we can actually run, i.e. including all the data that is being used.

mb706 commented 2 years ago

My assumption is that the "encode" PipeOp creates a column that is named ..row_id, which confuses mlr3 since it is in some way a reserved column name.

wangbaili commented 2 years ago

sorry for long time no reply,I have to solve some health problem. The data is confidential,but I got same encode problem at this dataaa1.xlsx. This data is all factor except (event="status",time="time") Thanks again as I await your suggestion

wangbaili commented 2 years ago

This the code

aa <- read_excel("C:/Users/LENOVO/Desktop/aa/aa1.xlsx") names(aa)

aa[,3:13]<-lapply(aa[,3:13],function(x)as.factor(as.character(x))) taskwork<-TaskSurv$new("taskwork",aa, time = "time", event = "status") learners <- lrns(paste0("surv.", c("coxtime", "deephit", "deepsurv", "loghaz", "pchazard")), frac = 0.3, early_stopping = TRUE, epochs = 10, optimizer = "adam" ) create_pipeops <- function(learner) { po("encode",method = "treatment") %>>% po("learner", learner) } learners <- lapply(learners, create_pipeops)

resampling <- rsmp("bootstrap", ratio=0.6,repeats=10) design <- benchmark_grid(taskwork,learners , resampling) bm <- benchmark(design)

wangbaili commented 2 years ago

This is the error:

Error in as_data_backend.data.frame(data, primary_key = row_ids) : Assertion on 'primary_key' failed: Contains duplicated values, position 2. This happened PipeOp encode's $train()

mb706 commented 2 years ago

Thanks! Apparently the problem is that bootstrapping uses some rows repeatedly, which somehow breaks with mlr3's assumption that row_ids are unique values.

Minimal example:

library("mlr3")
library("mlr3pipelines")
options(mlr3.debug=TRUE)
resample(tsk("iris"), po("pca") %>>% lrn("classif.featureless"), rsmp("bootstrap"))

I will try to take care of this soon, until then a workaround would be to use a different resampling method (e.g. rsmp("cv") instead of rsmp("bootstrap")).