tylermorganwall / skpr

Generates and evaluates D, I, A, Alias, E, T, G, and custom optimal designs. Supports generation and evaluation of mixture and split/split-split/N-split plot designs. Includes parametric and Monte Carlo power evaluation functions. Provides a framework to evaluate power using functions provided in other packages or written by the user.
https://tylermorganwall.github.io/skpr/
GNU General Public License v3.0
115 stars 15 forks source link

Issue with combination of Mixutre Components and categoric Process Factors #77

Closed neuhier closed 9 months ago

neuhier commented 1 year ago

I would like to set up a combined DOE using 3 mixture components and 1 categorical process factor (a simple version of this):

library(dplyr)
library(skpr)

candidate_set <- expand.grid(
  "M1" = seq(0, 100, by=10),
  "M2" = seq(0, 100, by=10),
  "M3" = seq(0, 100, by=10),
  "P1" = c("A","B")
) |> dplyr::filter(M1+M2+M3 == 100)

gen_design(
  candidate_set, 
  model = 
    ~ + -1 + 
    (M1 + M2 + M3)^3 + # Mixture Factors
    P1:M1 + P1:M2, # 2FI of mixture and process
  trials = 20, 
  repeats = 20
  )

Sadly this fails, with the following error message:

Error in gen_design(candidate_set, model = ~+-1 + (M1 + M2 + M3)^3 + P1:M1 +  : 
  skpr: The candidate set does not support the specified model - its rank is too low. This usually happens if disallowed combinations have introduced a perfect correlation between some variables in the candidate set. It can also happen if you have specified a quadratic model term but have only two levels of that factor in the candidate set, or (more generally) if two of your terms can be  expressed as a linear combination of another term. To solve this, you either need to choose a different model or add more factor combinations to your candidate set.

If I use effect coding for the process factor everything works just fine:

candidate_set <- expand.grid(
  "M1" = seq(0, 100, by=10),
  "M2" = seq(0, 100, by=10),
  "M3" = seq(0, 100, by=10),
  "P1" = c(-1,1)
) |> dplyr::filter(M1+M2+M3 == 100)

It would be really great if gen_design could handle categorical factors for this usecase, especially when these factors use more than 2 levels, e.g.:

candidate_set <- expand.grid(
  "M1" = seq(0, 100, by=10),
  "M2" = seq(0, 100, by=10),
  "M3" = seq(0, 100, by=10),
  "P1" = c("A","B", "C")
) |> dplyr::filter(M1+M2+M3 == 100)
tylermorganwall commented 1 year ago

How do you plan on analyzing your data? The issue is that this model has singularities when parsed with the candidate set in R. This is due to the way R's model.matrix() function expands categorical factors when the intercept term isn't included in the model: when you include the categorical factor and exclude the intercept term, R switches to dummy encoding for the first categorical factor. The resulting candidate set's information matrix has negative eigenvalues due to the correlation introduced by this default change to the model matrix, so mathematically we cannot find an "optimal" design.

The workaround would be doing the contrast encoding yourself as numeric variables.

You can see the singularities being reported below.

library(dplyr)
library(skpr)

candidate_set <- expand.grid(
  "M1" = seq(0, 100, by=10),
  "M2" = seq(0, 100, by=10),
  "M3" = seq(0, 100, by=10),
  "P1" = c("A","B")
) |> dplyr::filter(M1+M2+M3 == 100)

gen_design(
  candidate_set, 
  model = 
    ~ + -1 + 
    (M1 + M2 + M3)^3 + # Mixture Factors
    P1:M1 + P1:M2, # 2FI of mixture and process
  trials = 20, 
  repeats = 20
)
#> Warning in reduceRunMatrix(candidateset, model): Interaction-only term in
#> formula detected: it is rare that an interaction term should be present in a
#> model without the corresponding main effect term(s).
#> Error in gen_design(candidate_set, model = ~+-1 + (M1 + M2 + M3)^3 + P1:M1 + : skpr: The candidate set does not support the specified model - its rank is too low. This usually happens if disallowed combinations have introduced a perfect correlation between some variables in the candidate set. It can also happen if you have specified a quadratic model term but have only two levels of that factor in the candidate set, or (more generally) if two of your terms can be  expressed as a linear combination of another term. To solve this, you either need to choose a different model or add more factor combinations to your candidate set.

candidate_set$Y = rnorm(nrow(candidate_set))

lm(formula = Y ~ + -1 + (M1 + M2 + M3)^3 + P1:M1 + P1:M2,
   data=candidate_set) |> 
  summary()
#> 
#> Call:
#> lm(formula = Y ~ +-1 + (M1 + M2 + M3)^3 + P1:M1 + P1:M2, data = candidate_set)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.27281 -0.51750 -0.02724  0.65722  2.80957 
#> 
#> Coefficients: (1 not defined because of singularities)
#>            Estimate Std. Error t value Pr(>|t|)  
#> M1       -6.148e-03  4.511e-03  -1.363    0.175  
#> M2       -5.281e-03  4.511e-03  -1.171    0.244  
#> M3       -2.844e-03  3.986e-03  -0.714    0.477  
#> M1:M2     2.850e-04  1.891e-04   1.507    0.134  
#> M1:M3     2.941e-04  1.891e-04   1.555    0.122  
#> M2:M3     3.030e-04  1.891e-04   1.602    0.112  
#> M1:P1A   -3.510e-03  4.225e-03  -0.831    0.408  
#> M1:P1B           NA         NA      NA       NA  
#> M2:P1B   -3.146e-03  4.225e-03  -0.745    0.458  
#> M1:M2:M3 -1.683e-05  1.009e-05  -1.667    0.098 .
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9483 on 123 degrees of freedom
#> Multiple R-squared:  0.07251,    Adjusted R-squared:  0.004647 
#> F-statistic: 1.068 on 9 and 123 DF,  p-value: 0.3908

Created on 2023-08-10 with reprex v2.0.2