mayer79 / missRanger

Fast multivariate imputation by random forests.
https://mayer79.github.io/missRanger/
GNU General Public License v2.0
63 stars 11 forks source link

Question on missRanger code #77

Closed flystar233 closed 2 months ago

flystar233 commented 2 months ago

Hi, Michael Thanks for great imputation package. I am learning the specific algorithm for imputing data using random forests.

First, I perform normal data imputation.

iris2 <- generateNA(iris, p = c(0.1,0.1,0.8,0.2,0.1), seed = 2024)
imp <- missRanger(iris2,verbose = 0, seed = 1L, num.trees = 20,returnOOB =T,data_only =F)

and I added a line of debug code here:

    for (v in to_impute) {
      v.na <- data_NA[, v]
      cat("j:",j,"\t","completed(impute_by):",completed,"\t","to_impute:",v,"\n")
      if (length(completed) == 0L) {
        data[[v]] <- imputeUnivariate(data[[v]])

The result is:

j: 1     completed(impute_by):           to_impute: Sepal.Length 
j: 1     completed(impute_by): Sepal.Length      to_impute: Sepal.Width 
j: 1     completed(impute_by): Sepal.Length Sepal.Width          to_impute: Species 
j: 1     completed(impute_by): Sepal.Length Sepal.Width Species          to_impute: Petal.Width 
j: 1     completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width      to_impute: Petal.Length 
j: 2     completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width Petal.Length         to_impute: Sepal.Length 
j: 2     completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width Petal.Length         to_impute: Sepal.Width 
j: 2     completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width Petal.Length         to_impute: Species 
j: 2     completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width Petal.Length         to_impute: Petal.Width 
j: 2     completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width Petal.Length         to_impute: Petal.Length 
j: 3     completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width Petal.Length         to_impute: Sepal.Length 
j: 3     completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width Petal.Length         to_impute: Sepal.Width 
j: 3     completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width Petal.Length         to_impute: Species 
j: 3     completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width Petal.Length         to_impute: Petal.Width 
j: 3     completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width Petal.Length         to_impute: Petal.Length

So, I want to know why does this code keep combining the complete vector as impute_by in the first loop, instead of directly using all variables?

      if (j == 1L && (v %in% impute_by)) {
        completed <- union(completed, v)
      }
mayer79 commented 2 months ago

Good question!

During the first iteration, the variables still have missings and can't be used as covariates in ranger().

flystar233 commented 2 months ago
Let me understand it again.: If original data is: Sepal.Length Sepal.Width
1 NA
NA 1
1 2
NA NA

If Sepal.Length was not pre-filled, In this iteration j: 1 completed(impute_by): Sepal.Length to_impute: Sepal.Width: y <- data[[v]][!v.na] = c(1,2) x = data[!v.na, completed, drop = FALSE] = c(NA,1)

So, x cannot be accepted by ranger. It‘s right?

mayer79 commented 2 months ago

Yes, exactly. We even need to predict on the other rows (those with missing y in your example). There, we also need a complete x. So we need complete feature values for all rows, actually. Ideally, the next iterations would correct for a suboptimal first iteration.

flystar233 commented 2 months ago

Another question:

@mayer79 I found such data during the imputation process. completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width Petal.Length to_impute: Sepal.Length

It is quite puzzling to me that the x parameter to ranger clearly includes the y parameter. Shouldn't a normal fit be something like: Sepal.Length ~ Sepal.Width + Species + Petal.Width + Petal.Length ?

mayer79 commented 2 months ago

Good catch! This does not seem right indeed.

mayer79 commented 2 months ago

I have fixed this now in #78. Would be fantastic if you could install the github version and check also if the logic makes more sense now. remotes::install_github("mayer79/missRanger").

flystar233 commented 2 months ago

Yes,the logic more sense now, and the middle result is correct:

j: 1     completed(impute_by):           to_impute: Sepal.Length 
j: 1     completed(impute_by): Sepal.Length      to_impute: Sepal.Width
j: 1     completed(impute_by): Sepal.Length Sepal.Width          to_impute: Species
j: 1     completed(impute_by): Sepal.Length Sepal.Width Species          to_impute: Petal.Width
j: 1     completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width      to_impute: Petal.Length 
j: 2     completed(impute_by): Sepal.Width Species Petal.Width Petal.Length      to_impute: Sepal.Length
j: 2     completed(impute_by): Sepal.Length Species Petal.Width Petal.Length     to_impute: Sepal.Width
j: 2     completed(impute_by): Sepal.Length Sepal.Width Petal.Width Petal.Length         to_impute: Species 
j: 2     completed(impute_by): Sepal.Length Sepal.Width Species Petal.Length     to_impute: Petal.Width
j: 2     completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width      to_impute: Petal.Length
j: 3     completed(impute_by): Sepal.Width Species Petal.Width Petal.Length      to_impute: Sepal.Length
j: 3     completed(impute_by): Sepal.Length Species Petal.Width Petal.Length     to_impute: Sepal.Width 
j: 3     completed(impute_by): Sepal.Length Sepal.Width Petal.Width Petal.Length         to_impute: Species
j: 3     completed(impute_by): Sepal.Length Sepal.Width Species Petal.Length     to_impute: Petal.Width
j: 3     completed(impute_by): Sepal.Length Sepal.Width Species Petal.Width      to_impute: Petal.Length