has2k1 / plotnine

A Grammar of Graphics for Python
https://plotnine.org
MIT License
4.06k stars 217 forks source link

MemoryError in 0.7.0 #407

Open tdhopper opened 4 years ago

tdhopper commented 4 years ago

The build for Pythonplot.com started failing with 0.7.0 trying to run

(ggplot(diamonds) +

  aes('depth', fill='cut', color='cut') +

  geom_density(alpha=0.1))

I get a memory error thrown by statsmodels which you can see in the build log.

has2k1 commented 4 years ago

v0.7.0 changed the default method for computing the bandwidth for density estimation. This was in response to #317.

For the example at pythonplot.com, the difference captured by.

from plotnine.data import diamonds
from statsmodels.nonparametric.bandwidths import bw_normal_reference as nr
from statsmodels.sandbox.nonparametric import kernels
from plotnine.stats.stat_density import nrd0

k = kernels.Gaussian()
for _, gdf in diamonds.groupby('cut'):
    x = gdf.depth
    print(
        f'normal_reference: {nr(x, k)}\n'
        f'nrd0:             {nrd0(x)}\n'
    )
    Fair
normal_reference: 0.2689687852105908
nrd0:             0.2285370639409318

    Good
normal_reference: 0.35873214763916517
nrd0:             0.30480708643752136

    Very Good
normal_reference: 0.22284759734989557
nrd0:             0.18934887022210023

    Premium
normal_reference: 0.18242428971018324
nrd0:             0.15500204430500442

    Ideal
normal_reference: 0.09605607167961927
nrd0:             0.08161680389109885

where the normal_reference values are pre v0.7.0 and nrd0 values are v0.7.0. So far, it not clear to me why that would result in significantly more memory being used. nrd0 bandwidth is slightly the lesser for each group; that means when computing the density there are fewer points under each kernel function and therefore more kernel functions. I do not know how much memory travis allocates may be that difference is enough to tip it over in this case.

TODO NEXT: Find out the differences in memory used for these density computations.