has2k1 / scikit-misc

Miscellaneous tools for data analysis and scientific computing
https://has2k1.github.io/scikit-misc/stable
BSD 3-Clause "New" or "Revised" License
37 stars 9 forks source link

"segmentation fault" with huge loess stat_smooth (from plotnine) #3

Open saladpanda opened 6 years ago

saladpanda commented 6 years ago

I'm quite sure this is an issue of scikit-misc so I file it here. I ran into the following why doing plots with https://github.com/has2k1/plotnine.

#!/usr/bin/env python3

import numpy as np
import pandas as pd
from plotnine import *

time_int   = np.array(range(30000))
time_float = np.linspace(0, 500, 30000)
values = np.random.randint(1, 1000, 30000)

df = pd.DataFrame({'time_int': time_int, 'time_float': time_float, 'values': values})
df.info()

plot1 = ggplot(df, aes(x='time_int', y='values')) \
        + stat_smooth(method='loess')

plot2 = ggplot(df, aes(x='time_float', y='values')) \
        + stat_smooth(method='loess')

# print(plot1) # gives 'out of memory'
print(plot2) # crashes with segfault

With print(plot1) this prints:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 3 columns):
time_float    30000 non-null float64
time_int      30000 non-null int64
values        30000 non-null int64
dtypes: float64(1), int64(2)
memory usage: 703.2 KB
[skmisc/loess/src/misc.c:34] Out of memory (7200000000 bytes)

With print(plot2) this prints:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 3 columns):
time_float    30000 non-null float64
time_int      30000 non-null int64
values        30000 non-null int64
dtypes: float64(1), int64(2)
memory usage: 703.2 KB
zsh: segmentation fault (core dumped)  ./test.py
has2k1 commented 6 years ago

I understand and expect the "Out of memory" error given the size of the data; the loess algorithm is O(n^2) in memory. I do not expect a segfault, I think it is related to the low memory situation (probably unchecked malloc).

Both tests run on my system. But for 40000 rows, I get segfaults for both plots.

saladpanda commented 6 years ago

The "Out of memory" is absolutely expected. The bug I wanted to report is the segfault.

Now that I tested the above code again I get segfaults for both and can't manage to find a size where I just get "out of memory".

I noticed this while using plotnine in jupyter notebook. I had method set to loess and then increased the size of the dataframe. Suddenly the ipython kernel kept crashing when generating the plot. I think scikit-misc (or plotnine?) should catch this instead of crashing.

has2k1 commented 6 years ago

Yes, the segfaults cause the Jupyter kernel to crash.

antschum commented 2 years ago

Any update on this?