holub008 / xrf

eXtreme RuleFit (sparse linear models on XGBoost ensembles)
Other
43 stars 13 forks source link

Slow rule evaluation #5

Closed holub008 closed 5 years ago

holub008 commented 5 years ago

At large data sizes, rules are evaluated very slowly. This is due to clumsy, R-level handling of rule grouping + conjunction (here). This could either be done very quickly in C, or potentially using a dataframe full join.

holub008 commented 5 years ago

I hacked together:

evaluate_rules2 <- function(rules, data) {
  data_df <- as.data.frame(data)
  per_rule_evaluation <- rules %>%
    group_by(rule_id) %>%
    do (
      # yes, this is gross
      # yes, this is fast
      rule_evaluation = eval(parse(text=paste0(
          paste0('`', .$feature, '`'),
          ifelse(.$less_than, '<', '>='),
          .$split,
          collapse = ' & '
        )), 
        data_df) %>%
        as.integer() %>%
        data.frame()
    )
  rule_features <- bind_cols(per_rule_evaluation$rule_evaluation)
  colnames(rule_features) <- per_rule_evaluation$rule_id

  rule_features
}

without realizing that we need to support sparse matrices. Note the old approach accesses the data columns using standard R indexing, which is supported by dense and sparse matrix implementations.

This is a real bummer because the above solution is an order of magnitude faster than the existing one:

# following from the census data in README, using evaluate_rules2 as above
data <- model.matrix(above_50k ~ ., census_income)
> system.time(evaluate_rules(m_xrf$rules, data))
   user  system elapsed 
  8.773   0.288   9.089 
> system.time(evaluate_rules2(m_xrf$rules, data))
   user  system elapsed 
  0.720   0.094   0.816 

I think there are a couple approaches:

holub008 commented 5 years ago

I also noticed that dedupe_train_rules is horribly slow and could be made faster by building a correlation matrix with cor. The algorithm in general is sketchy in that the choice of rule removed is arbitrary; furthermore, the LASSO should automatically regularize away duplicates.

That feature was added in to reduce the computational load on glmnet, but I'm not sure it's justified. I might simply remove the feature in favor of removing any exact duplicate evaluated rules (which wouldn't be in any way contentious and therefore needs no user lever).

holub008 commented 5 years ago

The correlation feature was removed here: https://github.com/holub008/xrf/commit/460727baf42100c7ac4bfe40ee3c9328adaf5bcc The faster rule evaluation for dense data was added here: https://github.com/holub008/xrf/commit/10960de447d4e0c15aacde8c47f9be547fcd0a6d

It remains unsolved how to make rule evaluation faster for sparse data; I think that can only come from delving into c, which I'm not interested in for now.