Approximate mode for MICE is only slightly faster - Githubissues

iskandr / fancyimpute

Multivariate imputation and matrix completion algorithms implemented in Python

Apache License 2.0

1.25k stars 178 forks source link

Approximate mode for MICE is only slightly faster #4

Closed iskandr closed 8 years ago

iskandr commented 8 years ago

@sergeyf I tried running MICE on a random 1000x200 matrix with ~50% sparsity for 10 imputation rounds (and 10 burn-in rounds). With approximate mode it took about 33s on my Macbook and the full mode took ~50s. I suspect we should be able to get the approximate mode a lot faster than that!

Running again under a profiler, here are the top culprits:

 44001   10.532    0.000   10.532    0.000 {built-in method dot}
       20    5.201    0.260   21.036    1.052 MICE.py:178(perform_imputation_round)
     4000    1.340    0.000    1.843    0.000 MICE.py:111(_sub_inverse_covariance)
     4000    0.980    0.000    5.503    0.001 bayesian_ridge_regression.py:72(predict_dist)
    16604    0.589    0.000    0.589    0.000 {method 'reduce' of 'numpy.ufunc' objects}
     4000    0.390    0.000    2.791    0.001 MICE.py:137(_update_inverse_covariance)
     4000    0.328    0.000    0.343    0.000 numeric.py:1003(outer)
     4000    0.232    0.000    4.790    0.001 bayesian_ridge_regression.py:23(fit)
    12000    0.201    0.000    0.537    0.000 index_tricks.py:28(ix_)
     4000    0.194    0.000    0.273    0.000 {method 'normal' of 'mtrand.RandomState' objects}
    48009    0.173    0.000    0.173    0.000 {built-in method array}
     8000    0.150    0.000    0.150    0.000 {method 'nonzero' of 'numpy.ndarray' objects}
     4001    0.118    0.000    0.258    0.000 linalg.py:458(inv)

(unsurprisingly spending most of our time doing matrix multiplications)

sergeyf commented 8 years ago

Yeah, I know =(

The np.linalg.inv is really fast. It's a lot faster than svd and various other tricks I tried, so this approximate mode is bottlenecked by a crapload of matrix multiplies that (despite being O(d^2)) are still in aggregate not much faster than a single highly optimized O(d^3) calculation. Do you have any ideas? I'm pretty much out of steam.

iskandr commented 8 years ago

Can you make it even more "approximate" by performing blocks of multivariate regressions? Split the columns into random groups of n_cols/10 and fill them all in simultaneously?

sergeyf commented 8 years ago

Yeah I was thinking of something similar to this: fill in each column only from its nearest neighbor columns. You'll have to figure out nearest neighbor columns once in the beginning, and then pull those in.

The question is how to compute the NN -> just take the most correlated columns?

This would probably make a lot of sense in the Olivetti dataset as uncorrelated columns (pixel from face and pixel from background) indeed have little mutual information.

Do I understand you correctly and that this idea is similar to your last comment? Also, what do you mean by simultaneously? In parallel on multiple-cores?

sergeyf commented 8 years ago

Also, you refactored stuff and now S and S_inv are not part of the class - but the method _update_inverse_covariance has no return statement. Is that a bug?

iskandr commented 8 years ago

I was thinking of updating random groups of columns (without trying to find correlated blocks).

Also, doesn't _update_inverse_covariance perform the updates in-place on the elements of S and S_inv? If those changes aren't propagated then it's a bug.

sergeyf commented 8 years ago

Maybe my understanding of scope is not so good. If I pass S and S_inv to a function that updates them, but then doesn't return them - are they updated outside of the function as well? That's why I had them as self.S and self.S_inv before - so they were updated globally for the entire object.

I can try both randomly-chosen nearest columns and closest nearest columns next week!

sergeyf commented 8 years ago

OK, I removed approximate mode, and have added nearest column mode, so this issue no longer makes sense =) Stay tuned for the push.

iskandr commented 8 years ago

For "nearest columns" are you using the magnitude of the correlation between columns or does it have to be positive?

On Sun, Jan 10, 2016 at 5:58 PM, Sergey Feldman notifications@github.com wrote:

Closed #4 https://github.com/hammerlab/fancyimpute/issues/4.

— Reply to this email directly or view it on GitHub https://github.com/hammerlab/fancyimpute/issues/4#event-510617110.