jojo- / mipfp

Multidimensional Iterative Proportional Fitting and Alternative Models
GNU General Public License v2.0
21 stars 8 forks source link

Fast ipfp #6

Open okrebs opened 4 years ago

okrebs commented 4 years ago

Rcpp implementation including parallelization of the Ipfp algorithm. This greatly increases speed for large problems and fixes #4 and #5.

Apart from this the obtained (multidimensional) matrices are equal to the previous implementation up to machine (double) precision.

ellisp commented 3 years ago

Is this pull request under active consideration? Because it looks like it would be extremely useful.

ellisp commented 3 years ago

I am trialling this branch now. It is indeed much faster. I noticed that code that worked under the current CRAN version breaks under this version with a couple of small things:

These aren't big deals but do count as breaking changes that would need to be managed if this is merged in.

okrebs commented 3 years ago

Regarding your first point what is actually expected is an array (as 'seed' can be multidimensional, matrix should work as it has subclass 'array'). There probably should be a warning if a "data.frame" or similar is passed. However, I can not replicate your problem with my branch as passing a data.frame works for me:

library(mipfp)
Lade nötiges Paket: cmm
Lade nötiges Paket: Rsolnp
Lade nötiges Paket: numDeriv
mymat <- matrix(1, nrow = 2, ncol = 2)
mydat <- data.frame(a = c(1,1), b = c(1,1))
Ipfp(seed = mymat, target.list = list(1), target.data = list(c(3,3)))

Call:
Ipfp(seed = mymat, target.list = list(1), target.data = list(c(3, 
    3)))

Method:  ipfp - convergence:  TRUE 

Estimates:
     [,1] [,2]
[1,]  1.5  1.5
[2,]  1.5  1.5

Ipfp(seed = mydat, target.list = list(1), target.data = list(c(3,3)))

Call:
Ipfp(seed = mydat, target.list = list(1), target.data = list(c(3, 
    3)))

Method:  ipfp - convergence:  TRUE 

Estimates:
    a   b
1 1.5 1.5
2 1.5 1.5

Can you provide a MWE?

The same for your second point.

jojo- commented 3 years ago

Hi Okrebs,

Happy to merge your work into this repo. Thanks for your good work! And sorry for the extremely long delay...

Before doing the merge,

Cheers!

Johan

ellisp commented 2 years ago

Sorry for the long delay in responding, and not including a reprex in my first comments.

As it happens @okrebs your own example produces the error I was finding:

> mymat <- matrix(1, nrow = 2, ncol = 2)
> mydat <- data.frame(a = c(1,1), b = c(1,1))
> Ipfp(seed = mymat, target.list = list(1), target.data = list(c(3,3)))

Call:
Ipfp(seed = mymat, target.list = list(1), target.data = list(c(3, 
    3)))

Method:  ipfp - convergence:  TRUE 

Estimates:
     [,1] [,2]
[1,]  1.5  1.5
[2,]  1.5  1.5
> Ipfp(seed = mydat, target.list = list(1), target.data = list(c(3,3)))
 Error in .IpfpCoreC(seed, target.list, target.data, print, iter, tol, : 
Not compatible with requested type: [type=list; target=double]. 
> Ipfp(seed = as.matrix(mydat), target.list = list(1), target.data = list(c(3,3)))

Call:
Ipfp(seed = as.matrix(mydat), target.list = list(1), target.data = list(c(3, 
    3)))

Method:  ipfp - convergence:  TRUE 

Estimates:
     [,1] [,2]
[1,]  1.5  1.5
[2,]  1.5  1.5

There must be something else differing with our systems. The example above works fine with the CRAN version, but when I build from the fast-ipfp branch I get the above result - fails with a data.frame, but is ok with as.matrix() around the data.frame.

Here is my session info:

> devtools::session_info()
- Session info ------------------------------------------------------------------------------------
 setting  value                       
 version  R version 4.1.1 (2021-08-10)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United States.1252  
 ctype    English_United States.1252  
 tz       Australia/Sydney            
 date     2021-10-12                  

- Packages ----------------------------------------------------------------------------------------
 package     * version    date       lib source        
 cachem        1.0.6      2021-08-19 [1] CRAN (R 4.1.1)
 callr         3.7.0      2021-04-20 [1] CRAN (R 4.1.1)
 cli           3.0.1      2021-07-17 [1] CRAN (R 4.1.1)
 cmm         * 0.12       2018-03-11 [1] CRAN (R 4.1.1)
 crayon        1.4.1      2021-02-08 [1] CRAN (R 4.1.1)
 desc          1.4.0      2021-09-28 [1] CRAN (R 4.1.1)
 devtools      2.4.2      2021-06-07 [1] CRAN (R 4.1.1)
 ellipsis      0.3.2      2021-04-29 [1] CRAN (R 4.1.1)
 fastmap       1.1.0      2021-01-25 [1] CRAN (R 4.1.1)
 fs            1.5.0      2020-07-31 [1] CRAN (R 4.1.1)
 glue          1.4.2      2020-08-27 [1] CRAN (R 4.1.1)
 lifecycle     1.0.1      2021-09-24 [1] CRAN (R 4.1.1)
 magrittr      2.0.1      2020-11-17 [1] CRAN (R 4.1.1)
 memoise       2.0.0      2021-01-26 [1] CRAN (R 4.1.1)
 mipfp       * 3.1        2021-10-11 [1] local         
 numDeriv    * 2016.8-1.1 2019-06-06 [1] CRAN (R 4.1.1)
 pkgbuild      1.2.0      2020-12-15 [1] CRAN (R 4.1.1)
 pkgload       1.2.2      2021-09-11 [1] CRAN (R 4.1.1)
 prettyunits   1.1.1      2020-01-24 [1] CRAN (R 4.1.1)
 processx      3.5.2      2021-04-30 [1] CRAN (R 4.1.1)
 ps            1.6.0      2021-02-28 [1] CRAN (R 4.1.1)
 purrr         0.3.4      2020-04-17 [1] CRAN (R 4.1.1)
 R6            2.5.1      2021-08-19 [1] CRAN (R 4.1.1)
 Rcpp          1.0.7      2021-07-07 [1] CRAN (R 4.1.1)
 remotes       2.4.1      2021-09-29 [1] CRAN (R 4.1.1)
 rlang         0.4.11     2021-04-30 [1] CRAN (R 4.1.1)
 rprojroot     2.0.2      2020-11-15 [1] CRAN (R 4.1.1)
 Rsolnp      * 1.16       2015-12-28 [1] CRAN (R 4.1.1)
 sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.1.1)
 testthat      3.1.0      2021-10-04 [1] CRAN (R 4.1.1)
 truncnorm     1.0-8      2018-02-27 [1] CRAN (R 4.1.1)
 usethis       2.0.1      2021-02-10 [1] CRAN (R 4.1.1)
 withr         2.4.2      2021-04-18 [1] CRAN (R 4.1.1)

[1] D:/Peter/Documents/R/win-library/4.1
[2] C:/Program Files/R/R-4.1.1/library
ellisp commented 2 years ago

I should add that the second issue I noted, where in the CRAN version I had to 'melt' the results to get the desired output but with your fast-ipfp version I didn't have to, I can not reproduce, and I think the problem was probably entirely at my end. For the record, the following is my test case and both the CRAN version and the fast-ipfp version work ok with it:


seed_df <- data.frame(
  lions = 1:2,
  tigers = 4:5,
  bears = 7:8
)

row.names(seed_df) <- c("Africa", "India")

# marginal populations for Africa, India; lions, tigers, bears:
population_df <- list(c(100, 100), c(60, 110, 30))

fit <- Ipfp(
  seed = as.matrix(seed_df),
  target.list = list(1,2),
  target.data = population_df
)

raked <- fit[[1]]

dim(raked) # should be 2, 3
ellisp commented 2 years ago

Finally, here is another problem. When building the package, I see this note:

Note: break used in wrong context: no loop is visible at ipfp_multi_dim.R:107 

I can then force an error when I have multi dimensional fitting with inconsistent marginal totals:

library(mipfp)

seed_df <- data.frame(
  lions = 1:2,
  tigers = 4:5,
  bears = 7:8
)

row.names(seed_df) <- c("Africa", "India")

# marginal populations for Africa, India; lions, tigers, bears:
population_df <- list(c(100, 100), c(60, 120, 30)) # note inconsistent totals

fit <- Ipfp(
  seed = as.matrix(seed_df),
  target.list = list(1,2),
  target.data = population_df
)

produces this error in the fast-ipfp branch:

Error in Ipfp(seed = as.matrix(seed_df), target.list = list(1, 2), target.data = population_df) : 
  no loop for break/next, jumping to top level
In addition: Warning message:
In Ipfp(seed = as.matrix(seed_df), target.list = list(1, 2), target.data = population_df) :
  Target not consistent - shifting to probabilities!
                Check input data!

But if I reload and use the CRAN version it works as expected (noting that it returns probabilities, not marginal totals, because they were inconsistent - but at least it isn't an error)

> library(mipfp)
Loading required package: cmm
Loading required package: Rsolnp
Loading required package: numDeriv
> 
> seed_df <- data.frame(
+   lions = 1:2,
+   tigers = 4:5,
+   bears = 7:8
+ )
> 
> row.names(seed_df) <- c("Africa", "India")
> 
> # marginal populations for Africa, India; lions, tigers, bears:
> population_df <- list(c(100, 100), c(60, 120, 30)) # note inconsistent totals
> 
> 
> 
> fit <- Ipfp(
+   seed = as.matrix(seed_df),
+   target.list = list(1,2),
+   target.data = population_df
+ )
Warning message:
In Ipfp(seed = as.matrix(seed_df), target.list = list(1, 2), target.data = population_df) :
  Target not consistents - shifting to probabilities!
                  Check input data!

> 
> fit[[1]]
           lions    tigers      bears
Africa 0.1181559 0.3029328 0.07891131
India  0.1675584 0.2684957 0.06394583