tom-metherell / Mice.jl

a package for missing data handling via multiple imputation by chained equations in Julia. It is heavily based on the R package {mice} by Stef van Buuren, Karin Groothuis-Oudshoorn and collaborators.
https://tom-metherell.github.io/Mice.jl/
GNU General Public License v2.0
12 stars 2 forks source link

What is Mice doing when there is no missing data? #8

Closed nilshg closed 10 months ago

nilshg commented 11 months ago

I just tried to recreate the benchmark results and downloaded the cirrhosis data. I didn't clock that it was an R data set where missing data is NA so ended up having a data set without any missing data. Still:

julia> @time mice(df)
 46.910824 seconds (11.11 k allocations: 2.727 MiB, 99.95% gc time)
Mids(418×20 DataFrame
 Row │ ID     N_Days  Status   Drug             Age    Sex      Ascites  Hepatomegaly  Spiders  Edema    Bilirubin  Cholesterol  Albumin  Copper   Alk_Phos  SGOT     Tryglicerides  Platelets  Prothrombin  Stage
     │ Int64  Int64   String3  String15

at the end of the output I see:

"Iteration 1, variable N_Days: imputation skipped - no missing data.", "Iteration 1, variable Status: imputation skipped - no missing data.", "Iteration 1, variable Drug: imputation skipped
- no missing data.", "Iteration 1, variable Age: imputation skipped - no missing data.", "Iteration 1, variable Sex: imputation skipped - no missing data.", "Iteration 1, variable Ascites: imputation skipped - no missing data.", "Iteration 1, variable Hepatomegaly: imputation skipped - no missing data.", "Iteration 1, variable Spiders: imputation skipped - no missing data.", "Iteration 1, variable Edema: imputation skipped - no missing data."  …  "Iteration 10, variable Bilirubin: imputation skipped - no missing data.", "Iteration 10, variable Cholesterol: imputation skipped - no missing data.", "Iteration 10, variable Albumin: imputation skipped - no missing data.", "Iteration 10, variable Copper: imputation skipped - no missing data.", "Iteration 10, variable Alk_Phos: imputation skipped - no missing data.", "Iteration 10, variable SGOT: imputation skipped - no missing data.", "Iteration 10, variable Tryglicerides: imputation skipped -
no missing data.", "Iteration 10, variable Platelets: imputation skipped - no missing data.", "Iteration 10, variable Prothrombin: imputation skipped - no missing data.", "Iteration 10, variable Stage: imputation skipped - no missing data."])

so it looks like it realises there isn't anything missing but why does it run for 47 seconds and allocate 11k times?

nilshg commented 11 months ago

(As an aside, 99.95% gc time suggests that there is significant performance left on the table, have you profiled the code to see where the time is being spend and what is allocating so much?)

nilshg commented 11 months ago

(PPS and I now see that when Mice is actually imputing it's much worse, julia> @time mice(df) 66.330727 seconds (23.23 M allocations: 2.851 GiB, 67.27% gc time, 31.15% compilation time: <1% of which was recompilation) so profiling those allocations looks more important than I thought)

tom-metherell commented 11 months ago

@nilshg re: what happens when there are no missing data -

Haven't run the profiles yet so that's just my hunch. Can look into it more another time :D

nilshg commented 11 months ago

It seems to me the first line of mice() should be something like

any(isa.(missing, eltype.(eachcol(df)))) && return df

Also are you saying you are pre-allocating the necessary containers for the imputations already? In that case it's even more unexpected that you are seeing millions of allocations.

tom-metherell commented 11 months ago

I think the millions of allocations are not related to the imputation containers themselves but rather to the temporary matrices that are created on every iteration. The current procedure is:

This is mainly because any categorical variables present need to be converted to dummy variables before the linear algebra steps, which can't (easily) be done in place. In the special case where there are no categorical variables, the number of allocations that are necessary would be far fewer.

tom-metherell commented 11 months ago

WRT the allocations when doing nothing, I have now reduced the amount of stuff that mice() will do when there is nothing to impute (coming soon). As such:

julia> @time mice(data)
  0.001539 seconds (538 allocations: 173.953 KiB)                                                                                                        
Mids(418×20 DataFrame
 Row │ ID     N_Days  Status   Drug             Age    Sex      Ascites  Hepatomegaly  Spiders  Edema    Bilirubin  Cholesterol  Albumin  Copper   Alk_ ⋯
     │ Int64  Int64   String3  String15         Int64  String1  String3  String3       String3  String1  Float64    String7      Float64  String3  Stri ⋯

(mice() will now also throw an error when there is nothing to impute, but I disabled that for testing purposes)