Open tdhopper opened 4 years ago
v0.7.0 changed the default method for computing the bandwidth for density estimation. This was in response to #317.
For the example at pythonplot.com, the difference captured by.
from plotnine.data import diamonds
from statsmodels.nonparametric.bandwidths import bw_normal_reference as nr
from statsmodels.sandbox.nonparametric import kernels
from plotnine.stats.stat_density import nrd0
k = kernels.Gaussian()
for _, gdf in diamonds.groupby('cut'):
x = gdf.depth
print(
f'normal_reference: {nr(x, k)}\n'
f'nrd0: {nrd0(x)}\n'
)
Fair
normal_reference: 0.2689687852105908
nrd0: 0.2285370639409318
Good
normal_reference: 0.35873214763916517
nrd0: 0.30480708643752136
Very Good
normal_reference: 0.22284759734989557
nrd0: 0.18934887022210023
Premium
normal_reference: 0.18242428971018324
nrd0: 0.15500204430500442
Ideal
normal_reference: 0.09605607167961927
nrd0: 0.08161680389109885
where the normal_reference
values are pre v0.7.0 and nrd0
values are v0.7.0. So far, it not clear to me why that would result in significantly more memory being used. nrd0
bandwidth is slightly the lesser for each group; that means when computing the density there are fewer points under each kernel function and therefore more kernel functions. I do not know how much memory travis allocates may be that difference is enough to tip it over in this case.
TODO NEXT: Find out the differences in memory used for these density computations.
The build for Pythonplot.com started failing with 0.7.0 trying to run
I get a memory error thrown by statsmodels which you can see in the build log.