Closed holub008 closed 5 years ago
I hacked together:
evaluate_rules2 <- function(rules, data) {
data_df <- as.data.frame(data)
per_rule_evaluation <- rules %>%
group_by(rule_id) %>%
do (
# yes, this is gross
# yes, this is fast
rule_evaluation = eval(parse(text=paste0(
paste0('`', .$feature, '`'),
ifelse(.$less_than, '<', '>='),
.$split,
collapse = ' & '
)),
data_df) %>%
as.integer() %>%
data.frame()
)
rule_features <- bind_cols(per_rule_evaluation$rule_evaluation)
colnames(rule_features) <- per_rule_evaluation$rule_id
rule_features
}
without realizing that we need to support sparse matrices. Note the old approach accesses the data columns using standard R indexing, which is supported by dense and sparse matrix implementations.
This is a real bummer because the above solution is an order of magnitude faster than the existing one:
# following from the census data in README, using evaluate_rules2 as above
data <- model.matrix(above_50k ~ ., census_income)
> system.time(evaluate_rules(m_xrf$rules, data))
user system elapsed
8.773 0.288 9.089
> system.time(evaluate_rules2(m_xrf$rules, data))
user system elapsed
0.720 0.094 0.816
I think there are a couple approaches:
I also noticed that dedupe_train_rules
is horribly slow and could be made faster by building a correlation matrix with cor
. The algorithm in general is sketchy in that the choice of rule removed is arbitrary; furthermore, the LASSO should automatically regularize away duplicates.
That feature was added in to reduce the computational load on glmnet, but I'm not sure it's justified. I might simply remove the feature in favor of removing any exact duplicate evaluated rules (which wouldn't be in any way contentious and therefore needs no user lever).
The correlation feature was removed here: https://github.com/holub008/xrf/commit/460727baf42100c7ac4bfe40ee3c9328adaf5bcc The faster rule evaluation for dense data was added here: https://github.com/holub008/xrf/commit/10960de447d4e0c15aacde8c47f9be547fcd0a6d
It remains unsolved how to make rule evaluation faster for sparse data; I think that can only come from delving into c, which I'm not interested in for now.
At large data sizes, rules are evaluated very slowly. This is due to clumsy, R-level handling of rule grouping + conjunction (here). This could either be done very quickly in C, or potentially using a dataframe full join.