IPFP slow for large arrays

okrebs commented 5 years ago

Hi,

I am regularly applying Ipfp to relatively large arrays (4 dimensions totaling 10th of millions of values) with several constraints. Currently the Ipfp function is very slow in handling such large problems (solved overnight on relatively recent i7).

A large speed gain could come from parallelization of the central algorithm (in each iteration all values of a single margin can be calculated in parallel). However, currently the by far largest time constraint is the "aperm" function called on the large array by sweep, when applying the scaling factor for a specific target margin. After calling aperm the scaling factor is applied in a vectorized manner and therefore very fast. Little additional gain comes from parallelizing the latter step and it seems unreasonable to attempt to parallelize the permutation step.

I therefore aim to rewrite part of the core algorithm of the Ipfp function to rely on Rcpp and using for loops iterating over the appropriate indices for each margin. This avoids the need for sweep with its call to aperm, allowing to easily parallelize the algorithm for larger problems (passing something like parallel = TRUE or n.cores = X as parameter)

Any thoughts on this or other potential speed-ups of the Ipfp function are very welcome.

jojo- commented 5 years ago

Hi,

For large arrays, did you try ObtainModelEstimates in the pacage? Those will rely on RSolnp and might be faster.

I do agree that the core of the algorithm can be parallelized. Let me know how you progress on that.

I am actually looking at doing that using GPUs.

Cheers!

okrebs commented 5 years ago

Hi,

thanks for your answer and sorry for the late reply I had a lot to work on.

Regarding ObtainModelEstimate:

I am applying Ipfp in an economics context and not actually on contigency tables. I specifically need the raking estimator (or the cross entropy minimizer, i.e. equation 6 in the paper by Little and Wu that you quote) due to the property that (seed_ij / seed_nj) / (seed_ik / seed_nk) = (target_ij / target_nj) / (target_ik / target_nk). I have not worked through the code but it would probably be relatively easy to add this specification to the methods available in ObtainModelEstimate? Then this would indeed be an interesting option.
There seems to be a memory problem with the function, e.g. apply ObtainModelEstimate to the "large" problem at the end of the attached R file (a 10020100*20 array) fails, trying to allocate 1200GB of memory. I did not yet have time to debug this as I am less concerned with these options.

Regarding parallel/Rcpp implementation: I have replicated the Ipfp algorithm using Rcpp. It runs about 4 to 10 times faster without parallelization. This code can be found in my branch Ipfp-parallel. It passes all the checks in the attached R file (where I have build the branched version setting the name of the package to mipfpC in DESCRIPTION and adapting the dynlib call in NAMESPACE accordingly), i.e. results satisfy all.equal() compared to the original Ipfp function. I still want to run a few more test though.

This code should be relatively easy to parallelize now, as it is just a giant for-loop through the whole array. Unfortunately I know nothing so far about GPU programming. I was intending to just use openmp to run through the loop in parallel.

Regards!

test_and_benchmark.txt bm_large: bm_hdim: bm_hdim bm: bm png

okrebs commented 5 years ago

Please ignore the above test file which was faulty (even though the actual algorithm did pass the tests). I have now created a pull requested that fixes this issue alltogether.

jessexknight commented 7 months ago

I benchmarked this package to be faster than my 2D Rcpp IPF implementation for 10x10 to 100x100 matrices. However, when adding mipfp to a project, it ran much slower for some reason. So, here is my 2D Rcpp version in case it is faster for some users.

#include <Rcpp.h>
#include <boost/multi_array.hpp>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix ipf_cpp(NumericMatrix M, NumericVector m1, NumericVector m2, float tol=1e-16){
  NumericVector r1, r2;
  int i, k;
  for (k = 0; k < 1000; k++){
    r1 = m1 / rowSums(M);
    for (i = 0; i < M.nrow(); i++ ){ M(i,_) = M(i,_) * r1(i); }
    r2 = m2 / colSums(M);
    for (i = 0; i < M.ncol(); i++ ){ M(_,i) = M(_,i) * r2(i); }
    if (is_true(all( abs(r1-1) < tol & abs(r2-1) < tol))){ break; }
  }
  return M;
}

jojo- / mipfp

IPFP slow for large arrays #4