Closed nilshg closed 10 months ago
(As an aside, 99.95% gc time suggests that there is significant performance left on the table, have you profiled the code to see where the time is being spend and what is allocating so much?)
(PPS and I now see that when Mice is actually imputing it's much worse, julia> @time mice(df) 66.330727 seconds (23.23 M allocations: 2.851 GiB, 67.27% gc time, 31.15% compilation time: <1% of which was recompilation)
so profiling those allocations looks more important than I thought)
@nilshg re: what happens when there are no missing data -
mice()
function doesn't check whether it needs to do anything before initialising the matrices used to store imputations and their means/variances per imputation, so that's probably what the allocations are for. I could add a step where it checks for missing data before initialising these matrices, which would reduce the number of unnecessary allocations.GC.gc()
after every single iteration, and there's an argument gcSchedule
which allows this to be controlled. I found in testing that this increased performance for large jobs, but in cases like this it will of course make the performance much worse (50 GC.gc()
calls for no work, which is presumably what results in the 99.95% gc time). I assume (or hope?) that if you were to run @time mice(df, gcSchedule = 0.0)
the performance would be much better.Haven't run the profiles yet so that's just my hunch. Can look into it more another time :D
It seems to me the first line of mice()
should be something like
any(isa.(missing, eltype.(eachcol(df)))) && return df
Also are you saying you are pre-allocating the necessary containers for the imputations already? In that case it's even more unexpected that you are seeing millions of allocations.
I think the millions of allocations are not related to the imputation containers themselves but rather to the temporary matrices that are created on every iteration. The current procedure is:
This is mainly because any categorical variables present need to be converted to dummy variables before the linear algebra steps, which can't (easily) be done in place. In the special case where there are no categorical variables, the number of allocations that are necessary would be far fewer.
WRT the allocations when doing nothing, I have now reduced the amount of stuff that mice()
will do when there is nothing to impute (coming soon). As such:
julia> @time mice(data)
0.001539 seconds (538 allocations: 173.953 KiB)
Mids(418×20 DataFrame
Row │ ID N_Days Status Drug Age Sex Ascites Hepatomegaly Spiders Edema Bilirubin Cholesterol Albumin Copper Alk_ ⋯
│ Int64 Int64 String3 String15 Int64 String1 String3 String3 String3 String1 Float64 String7 Float64 String3 Stri ⋯
(mice()
will now also throw an error when there is nothing to impute, but I disabled that for testing purposes)
I just tried to recreate the benchmark results and downloaded the cirrhosis data. I didn't clock that it was an R data set where missing data is
NA
so ended up having a data set without any missing data. Still:at the end of the output I see:
so it looks like it realises there isn't anything missing but why does it run for 47 seconds and allocate 11k times?