component: Problems.DataAnalysis.Symbolic | priority: medium
2020-02-18 13:03:50: @foolnotion created the issue
There is a big difference between the speed of the tree interpreter and the speed of the (R-Squared) evaluator:
For an arithmetic grammar the new native interpreter achieves a speed of approx. 3 billion nodes/second.
By contrast, the Pearson's R-Squared evaluator is more than 20 times slower. Using the values returned by the interpreter, it only manages a speed of 0.14 billion nodes/second when computing the R2 value. This is an obvious bottleneck.
Of course, the evaluator performs some extra work:
Generating random subsets of rows from the training partition
Scaling the values and bounding them to the estimation limits
Checking for error conditions in the calculators (linear scaling calculator, pearson'r correlation calculator)
I think there are some inconsistencies and performance bottlenecks in the SymbolicDataAnalysisEvaluator design:
Spurious usage of Linq and Enumerators
Thread static cache for estimated values (not clear why this is needed since one tree should be evaluated just once)
In 99% of use cases the GenerateRows method just returns an IEnumerable<int> between TrainingPartition.Start and TrainingPartition.End, thus only introducing overhead
The underlying statistical calculators (mean-variance calculator, covariance, correlation, linear scaling) perform the same checks for the added values (IsNan, IsInfinity). this actually adds up to a noticeable amount of overhead since eg., the correlation calculator calls three other calculators in its Add method, performing the same checks
Suggested workarounds:
Use arrays: this will slightly increase memory pressure but we are generating those values anyway (generate rows, get tree values) so it would speed things up overall
As an example, adding a GetSymbolicExpressionTreeValues(ISymbolicExpressionTree, IDataset, int[]) to the ISymbolicDataAnalysisExpressionTreeInterpreter interface would simplify usage and not break existing contracts
Only check values added to the calculator once: I would provide an AddUnchecked method to the online calculators for cases when the caller knows what they are doing
Naively implementing these suggestions results in 2-4x improvement in evaluation speed:
Issue migrated from trac ticket # 3058
component: Problems.DataAnalysis.Symbolic | priority: medium
2020-02-18 13:03:50: @foolnotion created the issue