Open timmens opened 4 years ago
What has been done:
The commits (aca31b1ea3f76964) and (a2f504f9ccfbab8cd9) improve the speed of the inner loop (over observations) by a big margin.
In the first commit I changed most np.sum()
and np.mean()
calls for a dynamic sum extension.
In the second commit I swapped pd.DataFrame
data storage for the fast np.array
and now simply convert the end result to a pd.DataFrame
.
What still needs to be done:
numba
is disabled, since this allows to check what function calls make _find_optimal_split
slow. Current profiling has shown that the function _find_optimal_split
is still the only major concern.
Problem: Right now the function
_find_optimal_split
is very inefficient. In the inner loop oversplitting_points
I compute means and sums in every iteration, even though I could update an initial value.Solution: Implement dynamic updating algorithm that finds best splitting point for a given feature index.