holoviz / holoviews

With Holoviews, your data visualizes itself.
https://holoviews.org
BSD 3-Clause "New" or "Revised" License
2.7k stars 403 forks source link

Memory leak / increasing usage in Jupyter for repeated cell execution #1821

Closed jordansamuels closed 6 years ago

jordansamuels commented 7 years ago

I believe that with HoloViews 1.8.3 on Jupyter there is a non-trivial memory leak when repeatedly executing a cell. This creates an inconvenience when working/refining iteratively with large data sets.

I'm reporting this issue based on a real-world although admittedly somewhat complex use case, and I'll admit that I'm not sure I'm using HoloViews correctly. I'm seeing that as I repeatedly execute cells in a Jupyter notebook, the memory usage for the kernel grows without bound. This issues exists whether or not I'm using datashading and large datasets; since the memory increase is proportional to the data size, it's a lot more noticeable/problematic when there is a lot of data, I'll focus on the main case.

I'm combining several techniques in order to create a rich, user-friendly interface for reviewing my data. (Kudos to HoloViews for being able to do this at all!) The techniques are:

I've supplied some code below that is a non-proprietary repro of my use case. It definitely shows the same pattern of increased kernel memory for each cell invocation. Again, I'll say that I wrote it through trial and error, and I am by no means sure that I'm not abusing something and/or that there is a better way to accomplish the same things with HoloViews.

Initialization cell

import pandas as pd
import numpy as np
import holoviews as hv
from holoviews.operation.datashader import aggregate, shade, datashade, dynspread
import sys

hv.extension('bokeh')

n,k = 1_000_000,4
scales=np.linspace(1,10,k)

df = pd.concat([s * pd.DataFrame({
    'x1' : np.random.randn(n),
    'x2' : np.abs(np.random.randn(n)),
    'x3' : np.random.chisquare(1, n),
    'x4' : np.random.uniform(0,s,n),
    'y' : np.random.randn(n),
    's' : np.full(n, 1) 
}) for s in scales])

def extend_range(p, frac):
    a, b = np.min(p), np.max(p)
    m, l = (a + b) / 2, (b - a) / 2
    rv = (m - frac * l, m + frac * l)
    return rv

def pad_scatter(s: hv.Scatter, frac=1.05):
    df = s.dframe()
    r = {d.name: extend_range(df[d.name], frac) for d in (s.kdims + s.vdims)[0:2]}
    return s.redim.range(**r)

print(f'df is around {sys.getsizeof(df) // 1024_000} MB')

Running this cell, I get

df is around 218 MB

and my Jupyter kernel is around 1831M.

Evaluation cell

%%opts RGB {+framewise}
hv.Layout([dynspread(datashade(hv.HoloMap([(s, pad_scatter(hv.Scatter(df[df.s == s], kdims=[x,'y']))) for s in scales]))) for x in ['x1', 'x2', 'x3', 'x4']]).cols(2)

This gives me a very beautiful scaled layout of shaded scatters.

memoryleakexample

However, the memory usage as I repeatedly evaluate that cell in the notebook is: 2717M, 3455M, 4441M, 5307M etc.

In reality I'm working with much more data (dataframes of around 10-30GB), and even though I'm on a pretty beefy machine, it starts to become a fairly big problem as I poke around and do trial-and-error exploration. In reality I find myself having to restart the kernel pretty often.

I'm not using dask - maybe I should be - but I'm not sure that would fix the issue.

This issue does not appear to be related to datashader or the large size of the data. If I run something similar with much smaller n and using only a HoloMap instead of datashading, I see a similar increase in memory - just obviously a much smaller slope because n is smaller.

jlstevens commented 7 years ago

Thanks for reporting this and thanks for the example - this is important and we need to get to the bottom of the problem!

ceball commented 7 years ago

After you've moved the slider around for a bit and the memory usage has increased, what happens to memory usage if you then run import gc; gc.collect() - does it change?

jordansamuels commented 7 years ago

@ceball The answer to your question is: it doesn't help. To be clear, the issue I'm reporting is about increase on re-evaluating the notebook cell, not on moving the slider. That said, your comment prompted me to test the slider - that also leaks memory. I can slide 1,2,3,4,3,2,1,2,3,4 etc. and the memory keeps growing.

In either case, the slider or the cell, calling gc.collect() seems to have no effect on the memory.

Also, to be totally clear, I'm looking at VIRT and RES in htop for my measure of memory usage. I'm on Fedora release 20 (Heisenbug).

ceball commented 7 years ago

the issue I'm reporting is about increase on re-evaluating the notebook cell, not on moving the slider

Ah, sorry, I got distracted by the image loop :)

I then wondered if this could be related to the notebook storing all outputs in Out? Although presumably storing Layouts shouldn't increase memory usage much, so seeing a never ending increase during multiple executions of the cell and during slider dragging does sound wrong!

calling gc.collect() seems to have no effect on the memory.

After I suggested that, I then began to remember that python doesn't necessarily return freed memory back to the operating system anyway. I'm out of touch with what current python versions on various operating systems do under different circumstances, though.

ceball commented 7 years ago

I should have added that I think someone might have to profile memory usage inside as well as outside python to find out what's going on; speculation like mine probably doesn't help at all! Although speculation from a holoviews developer might be more useful ;) My experience has been that e.g. pandas copies data when you might not expect it, and you don't know when python will free that memory, and then you don't know if that freed memory will ever make it back to the operating system (it does on some platforms, depending on version of python/c runtime, I think...).

(Although you mentioned holoviews 1.8.3 specifically - does that mean it's something that happens for you now, but didn't happen with previous versions?)

ceball commented 7 years ago

I said speculation isn't a good idea, but I can't help myself...

What happens if you replace hv.Scatter(df[df.s == s] with just hv.Scatter(df? Does memory usage still keep going up?

ea42gh commented 7 years ago

In the notebook, all inputs and outputs are saved: First cell executed the first time: In[1]: a=3; a Out[1]: 3 In the same cell or other cells, both In[1] and Out[1] will be available. The integer index for In[n], Out[n] increases on each execution of the cell. So, how big is Out[n] in your case?

jlstevens commented 7 years ago

Here is an example:

image

Here the second cell was re-executed a bunch of times but the notebook keeps holds onto every intermediate output, even if the user doesn't define their own handle on it!

This isn't new notebook behavior and has nothing to do with HoloViews. That said, this keeps surprising me as the policy of holding every output ever shown in a session just doesn't seem sensible to me: it might have made sense back in the days when everything was simply a string repr, but it doesn't make sense in a world where people are visualizing large datasets.

A sensible policy would be to hold onto the last n outputs, up to some memory limit instead of expanding endlessly - maybe there is some such setting somewhere, I haven't looked into it yet.

philippjfr commented 7 years ago

One way to check whether it's the Jupyter caching that's causing this is to reset the out variable using %reset out, which should delete all references to the output.

ceball commented 7 years ago

Yes, I agree, demonstrating whether or not it's actually the output caching seems like a good idea. However, even if that is the cause, you may not see memory returned to your operating system after clearing Out anyway (does anyone here know the status of that for cpython on linux these days?). I'm just saying that it might need to be checked carefully (e.g. within python as well as without).

The memory usage going up even just dragging the slider indicates it might not be output caching, maybe? (Or does dragging the slider cause a cell execution that jupyter could cache?)

In any case, output caching ought not to be an issue: holoviews is not creating new data, right? Each execution should not add more than a small amount to the memory being used (compared to the original data size). That's why I'm wondering if e.g. a pandas copy is happening.

jlstevens commented 7 years ago

I think this is using a DynamicMap as dynspread and datashade are being used. DynamicMaps have a cache that will fill as you move the slider around but you can easily control it with the cache_size parameter:

    cache_size = param.Integer(default=500, doc="""
       The number of entries to cache for fast access. This is an LRU
       cache where the least recently used item is overwritten once
       the cache is full.""")

As you can see, the default is 500 - try setting it to 1 to see if that helps...

ceball commented 7 years ago

Showing my ignorance of holoviews...but what is it caching? Or to put it another way, when I use holoviews, when would I need to worry that holoviews will generate and/or store large amounts of data?

jlstevens commented 7 years ago

when would I need to worry that holoviews will generate and/or store large amounts of data?

In short, when using a HoloMap everything is static and can be exported to a standalone, offline HTML file which means everything is in memory.

For DynamicMap, output is assumed to be a function (a fixed output for a given set of arguments) which means that holoviews can look up the item in the cache if you revisit a particular slider position (instead of recomputing the same thing again). This cache is as big as cache_size and you can set it to a low value if necessary (e.g 1) and naturally, if the item isn't available in the cache you will need to recompute the value shown.

Hope that makes some sense!

ceball commented 7 years ago

Thanks for answering that. Sorry to be slightly hijacking this issue.

In short, when using a HoloMap everything is static and can be exported to a standalone, offline HTML file which means everything is in memory.

Say I have a huge dataset in "a dataframe" (which might be distributed via dask). If I'm using datashader with holoviews so that any individual plot is a reasonable size, then would I ever really need to worry about memory usage by holoviews?

jlstevens commented 7 years ago

You shouldn't need to worry as HoloViews' datashader support uses DynamicMap and datashader returns RGB images that shouldn't be too big (i.e sensible for your screen resolution). I suppose it is true that a cache of 500 such RGB images could get quite memory intensive...

@philippjfr Maybe we should reduce the cache_size for the DynamicMaps returned by the datashader operations? Do you think the datashader output RGBs might be taking up lots of space in the cache?

philippjfr commented 7 years ago

Say I have a huge dataset in "a dataframe" (which might be distributed via dask). If I'm using datashader with holoviews so that any individual plot is a reasonable size, then would I ever really need to worry about memory usage by holoviews?

In theory that's accurate although for interactive use using an out-of-core dataframe can be a bit slow. I'm not sure if caching the RGB outputs of the datashade operation is a real issue since it only caches on key dimensions and datashading works via streams, so you should generally only ever have one value in the cache.

In general working with an in-memory dask dataframe is quite efficient and I'd recommend working with them whenever you think making in memory copies of something might be a concern. Even a groupby on a large dataset shouldn't cause issues because something like nyc_taxi_dataset.groupby('dropoff_hour', dynamic=True/False) would create a HoloMap/DynamicMap of Elements, where the data are simply dask graphs selecting the subset of data for a particular hour, which means the subsets are only in memory while the data is actually accessed for datashading or plotting. Doing the same using a large pandas dataframe would be quite wasteful though since it will make actual copies of each subset/group and insert them into the DynamicMap cache.

So my recommendation for large datasets is to use dask even if they still fit in memory. I'd also be open to decreasing the cache size since in practice I never use the cache in any meaningful way.

ceball commented 7 years ago

@jlstevens, ok, that's great - that's what I was expecting to hear :)

Since the original issue above says that as the data size is increased, the problem becomes worse, that to me implies it should not be related to hv caching of datashaded images (those won't change in size based on the size of the data, right?), and should also not be related to notebook caching of outputs (for the same reason: no large data being created/stored by hv).

philippjfr commented 7 years ago

that to me implies it should not be related to hv caching of datashaded images (those won't change in size based on the size of the data, right?), and should also not be related to notebook caching of outputs (for the same reason: no large data being created/stored by hv).

Not sure that's accurate, both the input and the output of the datashade operation are cached so whenever you drag the slider it will cache both the raw dataframe and the aggregated image in memory. My bet is that it is indeed the combination of the DynamicMap cache and Jupyter output caching that's causing this.

ceball commented 7 years ago

Cell execution: In Jean-Luc's example, the input data is being created every time, so I understand that the notebook cache would grow every time running the hv.Image cell. However, in the original example, the input data is created before hv is involved: hv will just point to it, not copy it...right?! (My suggestion was that maybe the pandas indexing operation was causing a copy, or something like that. Yes, the images will be cached in memory, but they aren't large compared to the data, right?)

Dragging slider: isn't that the same in the original example? The input data is already created, so hv would just point to it, not copy it? (Again, except if the data's being copied for some reason.)

I'm definitely not going to bet against you. I should have taken my own advice and stopped speculating ;)

philippjfr commented 7 years ago

I should have read his example more closely, I thought the dataframe was being created inside a function but as you point out it isn't. The other thing is that the example actually uses a HoloMap so dragging the slider should make no real difference, since everything should be pre-allocated. I do wonder if df[df.s == s] is making copies for each subset though, in which case using a dask dataframe along with a dynamic groupby would be more efficient. Either way though, this:

hv.Layout([dynspread(datashade(hv.HoloMap([(s, pad_scatter(hv.Scatter(df[df.s == s], kdims=[x,'y']))) for s in scales])))
          for x in ['x1', 'x2', 'x3', 'x4']]).cols(2)

is better expressed as:

dataset = hv.Dataset(df)
hv.Layout([dynspread(datashade(ds.to(hv.Scatter, x, 'y', 's').map(pad_scatter, hv.Scatter)))
          for x in ['x1', 'x2', 'x3', 'x4']]).cols(2)
jlstevens commented 7 years ago

... datashading works via streams, so you should generally only ever have one value in the cache.

Good point! I forgot about that important detail - the DynamicMap cache shouldn't be the issue here.

jordansamuels commented 7 years ago

Update it appears that this is largely a system install issue! After some more testing, it appears that the only box that this occurs on is a Linux box we use at work that has a more ad hoc installation of holoviews, etc. When I run on OSX or a Linux setup that is set up via a self-contained environment.yml and fresh conda install, I don't get the same issues. I'll work with our sysadmins to dig in.

ceball commented 7 years ago

I wanted to back up my suggestion that there's a copy happening...

Outside of the notebook, if I run the OP's "Evaluation cell" in a loop like this:

print("a",time.time()-t0)

for i in range(repeats):
    if copy:
        hv.Layout([dynspread(datashade(hv.HoloMap([(s, pad_scatter(hv.Scatter(df[df.s == s], kdims=[x,'y']))) for s in scales]))) for x in ['x1', 'x2', 'x3', 'x4']]).cols(2)
    else:
        hv.Layout([dynspread(datashade(hv.HoloMap([(s, pad_scatter(hv.Scatter(df           , kdims=[x,'y']))) for s in scales]))) for x in ['x1', 'x2', 'x3', 'x4']]).cols(2)
    print(i,time.time()-t0)

print("b",time.time()-t0)

I see the following with 'copy':

(hvdev) [170907 202555]~/code/ioam/holoviews2$ mprof run -T 0.2 testmem.py 5 1
mprof: Sampling memory every 0.2s
running as a Python program...
repeats=5 copy=True
df is around 218 MB
a 2.94058895111084
0 4.017941951751709
1 5.142473936080933
2 6.26465106010437
3 7.38116192817688
4 8.502644062042236
b 8.502682209014893

figure_1

And without 'copy':

(hvdev) [170907 202636]~/code/ioam/holoviews2$ mprof run -T 0.2 testmem.py 5 0
mprof: Sampling memory every 0.2s
running as a Python program...
repeats=5 copy=False
df is around 218 MB
a 3.2328739166259766
0 6.466661214828491
1 9.706587076187134
2 12.81578016281128
3 15.873624086380005
4 19.001055002212524
b 19.001084089279175

figure_1

Not that it's very exciting, but I've attached the script. mprof is memory_profiler 0.47 (https://pypi.python.org/pypi/memory_profiler). I'm using a mac with 16 GB ram.

Presumably in the first case python could at some point garbage collect the objects, but even then whether the memory would be "returned to the operating system" is not clear to me (I think it might depend on the platform).

ceball commented 7 years ago

Actually, here's the script, because I couldn't attach it as a .py file:

import time
t0 = time.time()

import sys
repeats, copy  = sys.argv[1::]
repeats = int(repeats)
copy = False if copy=='0' else True
print("repeats=%s"%repeats,"copy=%s"%copy)

############################################################
### code from issue

import pandas as pd
import numpy as np
import holoviews as hv
from holoviews.operation.datashader import aggregate, shade, datashade, dynspread
import sys

hv.extension('bokeh')

n,k = 1_000_000,4
scales=np.linspace(1,10,k)

df = pd.concat([s * pd.DataFrame({
    'x1' : np.random.randn(n),
    'x2' : np.abs(np.random.randn(n)),
    'x3' : np.random.chisquare(1, n),
    'x4' : np.random.uniform(0,s,n),
    'y' : np.random.randn(n),
    's' : np.full(n, 1) 
}) for s in scales])

def extend_range(p, frac):
    a, b = np.min(p), np.max(p)
    m, l = (a + b) / 2, (b - a) / 2
    rv = (m - frac * l, m + frac * l)
    return rv

def pad_scatter(s: hv.Scatter, frac=1.05):
    df = s.dframe()
    r = {d.name: extend_range(df[d.name], frac) for d in (s.kdims + s.vdims)[0:2]}
    return s.redim.range(**r)

print(f'df is around {sys.getsizeof(df) // 1024_000} MB')

############################################################

print("a",time.time()-t0)

for i in range(repeats):
    if copy:
        hv.Layout([dynspread(datashade(hv.HoloMap([(s, pad_scatter(hv.Scatter(df[df.s == s], kdims=[x,'y']))) for s in scales]))) for x in ['x1', 'x2', 'x3', 'x4']]).cols(2)
    else:
        hv.Layout([dynspread(datashade(hv.HoloMap([(s, pad_scatter(hv.Scatter(df           , kdims=[x,'y']))) for s in scales]))) for x in ['x1', 'x2', 'x3', 'x4']]).cols(2)
    print(i,time.time()-t0)

print("b",time.time()-t0)
ceball commented 7 years ago

I also meant to add that I think this issue could be closed, because it doesn't seem to be a problem specific to holoviews. (The issue could always be reopened if new evidence is provided pointing at holoviews doing something wrong.)

jlstevens commented 7 years ago

@philippjfr If you can confirm that Chris is correct in saying that there is a memory leak, but it isn't HoloViews, then we should close this issue. I'm wondering if we should mention something about this in the docs, but I'm not quite sure where.

jbednar commented 7 years ago

I think the issue should only be closed with no further action if it was all just a confusion. Whether or not the problem is HoloView's fault, if there is indeed a problem then either we need to mention it in the docs as a caveat, we need to explain how to avoid it in the docs, or we need to chase some other project to get it fixed. I can't tell from a quick scan of the above which one of these it is.

ceball commented 7 years ago

If you can confirm that Chris is correct in saying that there is a memory leak

I didn't actually say that :) I said there's a copy happening, and it's happening outside of holoviews (in "user-level pandas code").

I also said:

Presumably in the first case [where pandas copy is happening] python could at some point garbage collect the objects, but even then whether the memory would be "returned to the operating system" is not clear to me (I think it might depend on the platform).

As in, I was speculating that maybe hv creates a cycle (involving the dataframe), so the memory won't be freed until garbage collection happens? (Do you know of such a cycle in hv?) And also that even if the memory is "freed" (i.e. the copies of the dataframe get garbage collected), I'm not sure whether or not you will see it by looking at the operating system's report of memory usage by python.

However, beyond the dataframe copy by pandas, all that's just speculation: you may be right to refer to what's happening as a leak, but I guess we'd need to demonstrate that the memory just keeps growing forever in the loop/is never available again for python to use (even after gc runs). So I probably did stop too early to say there's no problem in holoviews.

philippjfr commented 6 years ago

This seems to be largely down to the fact that the user code made copies of the underlying data, so I'm going to close. I don't think it is particularly surprising that shallow copies of a dataframe made in a non-lazy way using df[df.s == s] keep increasing memory usage when there is a persistent handle on the object so I don't think documenting this would be particularly helpful either.

NumesSanguis commented 5 years ago

This issue might be related to the Bokeh issue: https://github.com/bokeh/bokeh/issues/8626 Just leaving this here for findability.

banderlog commented 3 years ago

I have similar problem with tf2, pandas and matplotlib and %reset out actually helped, when gc.collect() did not. Thx, @philippjfr

Links for understanding how it works:

github-actions[bot] commented 1 week ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.