Closed flystar233 closed 2 months ago
Good question!
During the first iteration, the variables still have missings and can't be used as covariates in ranger()
.
Let me understand it again.: If original data is: | Sepal.Length | Sepal.Width |
---|---|---|
1 | NA | |
NA | 1 | |
1 | 2 | |
NA | NA |
If Sepal.Length was not pre-filled, In this iteration j: 1 completed(impute_by): Sepal.Length to_impute: Sepal.Width
:
y <- data[[v]][!v.na] = c(1,2)
x = data[!v.na, completed, drop = FALSE] = c(NA,1)
So, x cannot be accepted by ranger. It‘s right?
Yes, exactly. We even need to predict on the other rows (those with missing y in your example). There, we also need a complete x. So we need complete feature values for all rows, actually. Ideally, the next iterations would correct for a suboptimal first iteration.
@mayer79
I found such data during the imputation process.
completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width Petal.Length to_impute: Sepal.Length
It is quite puzzling to me that the x parameter to ranger clearly includes the y parameter.
Shouldn't a normal fit be something like: Sepal.Length ~ Sepal.Width + Species + Petal.Width + Petal.Length
?
Good catch! This does not seem right indeed.
I have fixed this now in #78. Would be fantastic if you could install the github version and check also if the logic makes more sense now. remotes::install_github("mayer79/missRanger")
.
Yes,the logic more sense now, and the middle result is correct:
j: 1 completed(impute_by): to_impute: Sepal.Length
j: 1 completed(impute_by): Sepal.Length to_impute: Sepal.Width
j: 1 completed(impute_by): Sepal.Length Sepal.Width to_impute: Species
j: 1 completed(impute_by): Sepal.Length Sepal.Width Species to_impute: Petal.Width
j: 1 completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width to_impute: Petal.Length
j: 2 completed(impute_by): Sepal.Width Species Petal.Width Petal.Length to_impute: Sepal.Length
j: 2 completed(impute_by): Sepal.Length Species Petal.Width Petal.Length to_impute: Sepal.Width
j: 2 completed(impute_by): Sepal.Length Sepal.Width Petal.Width Petal.Length to_impute: Species
j: 2 completed(impute_by): Sepal.Length Sepal.Width Species Petal.Length to_impute: Petal.Width
j: 2 completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width to_impute: Petal.Length
j: 3 completed(impute_by): Sepal.Width Species Petal.Width Petal.Length to_impute: Sepal.Length
j: 3 completed(impute_by): Sepal.Length Species Petal.Width Petal.Length to_impute: Sepal.Width
j: 3 completed(impute_by): Sepal.Length Sepal.Width Petal.Width Petal.Length to_impute: Species
j: 3 completed(impute_by): Sepal.Length Sepal.Width Species Petal.Length to_impute: Petal.Width
j: 3 completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width to_impute: Petal.Length
Hi, Michael Thanks for great imputation package. I am learning the specific algorithm for imputing data using random forests.
First, I perform normal data imputation.
and I added a line of debug code here:
The result is:
So, I want to know why does this code keep combining the complete vector as impute_by in the first loop, instead of directly using all variables?