Closed jordansamuels closed 6 years ago
Thanks for reporting this and thanks for the example - this is important and we need to get to the bottom of the problem!
After you've moved the slider around for a bit and the memory usage has increased, what happens to memory usage if you then run import gc; gc.collect()
- does it change?
@ceball The answer to your question is: it doesn't help. To be clear, the issue I'm reporting is about increase on re-evaluating the notebook cell, not on moving the slider. That said, your comment prompted me to test the slider - that also leaks memory. I can slide 1,2,3,4,3,2,1,2,3,4 etc. and the memory keeps growing.
In either case, the slider or the cell, calling gc.collect()
seems to have no effect on the memory.
Also, to be totally clear, I'm looking at VIRT
and RES
in htop for my measure of memory usage. I'm on Fedora release 20 (Heisenbug)
.
the issue I'm reporting is about increase on re-evaluating the notebook cell, not on moving the slider
Ah, sorry, I got distracted by the image loop :)
I then wondered if this could be related to the notebook storing all outputs in Out
? Although presumably storing Layouts shouldn't increase memory usage much, so seeing a never ending increase during multiple executions of the cell and during slider dragging does sound wrong!
calling gc.collect() seems to have no effect on the memory.
After I suggested that, I then began to remember that python doesn't necessarily return freed memory back to the operating system anyway. I'm out of touch with what current python versions on various operating systems do under different circumstances, though.
I should have added that I think someone might have to profile memory usage inside as well as outside python to find out what's going on; speculation like mine probably doesn't help at all! Although speculation from a holoviews developer might be more useful ;) My experience has been that e.g. pandas copies data when you might not expect it, and you don't know when python will free that memory, and then you don't know if that freed memory will ever make it back to the operating system (it does on some platforms, depending on version of python/c runtime, I think...).
(Although you mentioned holoviews 1.8.3 specifically - does that mean it's something that happens for you now, but didn't happen with previous versions?)
I said speculation isn't a good idea, but I can't help myself...
What happens if you replace hv.Scatter(df[df.s == s]
with just hv.Scatter(df
? Does memory usage still keep going up?
In the notebook, all inputs and outputs are saved: First cell executed the first time: In[1]: a=3; a Out[1]: 3 In the same cell or other cells, both In[1] and Out[1] will be available. The integer index for In[n], Out[n] increases on each execution of the cell. So, how big is Out[n] in your case?
Here is an example:
Here the second cell was re-executed a bunch of times but the notebook keeps holds onto every intermediate output, even if the user doesn't define their own handle on it!
This isn't new notebook behavior and has nothing to do with HoloViews. That said, this keeps surprising me as the policy of holding every output ever shown in a session just doesn't seem sensible to me: it might have made sense back in the days when everything was simply a string repr, but it doesn't make sense in a world where people are visualizing large datasets.
A sensible policy would be to hold onto the last n outputs, up to some memory limit instead of expanding endlessly - maybe there is some such setting somewhere, I haven't looked into it yet.
One way to check whether it's the Jupyter caching that's causing this is to reset the out
variable using %reset out
, which should delete all references to the output.
Yes, I agree, demonstrating whether or not it's actually the output caching seems like a good idea. However, even if that is the cause, you may not see memory returned to your operating system after clearing Out
anyway (does anyone here know the status of that for cpython on linux these days?). I'm just saying that it might need to be checked carefully (e.g. within python as well as without).
The memory usage going up even just dragging the slider indicates it might not be output caching, maybe? (Or does dragging the slider cause a cell execution that jupyter could cache?)
In any case, output caching ought not to be an issue: holoviews is not creating new data, right? Each execution should not add more than a small amount to the memory being used (compared to the original data size). That's why I'm wondering if e.g. a pandas copy is happening.
I think this is using a DynamicMap
as dynspread
and datashade
are being used. DynamicMaps
have a cache that will fill as you move the slider around but you can easily control it with the cache_size
parameter:
cache_size = param.Integer(default=500, doc="""
The number of entries to cache for fast access. This is an LRU
cache where the least recently used item is overwritten once
the cache is full.""")
As you can see, the default is 500 - try setting it to 1
to see if that helps...
Showing my ignorance of holoviews...but what is it caching? Or to put it another way, when I use holoviews, when would I need to worry that holoviews will generate and/or store large amounts of data?
when would I need to worry that holoviews will generate and/or store large amounts of data?
In short, when using a HoloMap
everything is static and can be exported to a standalone, offline HTML file which means everything is in memory.
For DynamicMap
, output is assumed to be a function (a fixed output for a given set of arguments) which means that holoviews can look up the item in the cache if you revisit a particular slider position (instead of recomputing the same thing again). This cache is as big as cache_size
and you can set it to a low value if necessary (e.g 1
) and naturally, if the item isn't available in the cache you will need to recompute the value shown.
Hope that makes some sense!
Thanks for answering that. Sorry to be slightly hijacking this issue.
In short, when using a HoloMap everything is static and can be exported to a standalone, offline HTML file which means everything is in memory.
Say I have a huge dataset in "a dataframe" (which might be distributed via dask). If I'm using datashader with holoviews so that any individual plot is a reasonable size, then would I ever really need to worry about memory usage by holoviews?
You shouldn't need to worry as HoloViews' datashader support uses DynamicMap
and datashader returns RGB images that shouldn't be too big (i.e sensible for your screen resolution). I suppose it is true that a cache of 500 such RGB images could get quite memory intensive...
@philippjfr Maybe we should reduce the cache_size
for the DynamicMap
s returned by the datashader operations? Do you think the datashader output RGBs might be taking up lots of space in the cache?
Say I have a huge dataset in "a dataframe" (which might be distributed via dask). If I'm using datashader with holoviews so that any individual plot is a reasonable size, then would I ever really need to worry about memory usage by holoviews?
In theory that's accurate although for interactive use using an out-of-core dataframe can be a bit slow. I'm not sure if caching the RGB outputs of the datashade operation is a real issue since it only caches on key dimensions and datashading works via streams, so you should generally only ever have one value in the cache.
In general working with an in-memory dask dataframe is quite efficient and I'd recommend working with them whenever you think making in memory copies of something might be a concern. Even a groupby on a large dataset shouldn't cause issues because something like nyc_taxi_dataset.groupby('dropoff_hour', dynamic=True/False)
would create a HoloMap/DynamicMap of Elements, where the data are simply dask graphs selecting the subset of data for a particular hour, which means the subsets are only in memory while the data is actually accessed for datashading or plotting. Doing the same using a large pandas dataframe would be quite wasteful though since it will make actual copies of each subset/group and insert them into the DynamicMap cache.
So my recommendation for large datasets is to use dask even if they still fit in memory. I'd also be open to decreasing the cache size since in practice I never use the cache in any meaningful way.
@jlstevens, ok, that's great - that's what I was expecting to hear :)
Since the original issue above says that as the data size is increased, the problem becomes worse, that to me implies it should not be related to hv caching of datashaded images (those won't change in size based on the size of the data, right?), and should also not be related to notebook caching of outputs (for the same reason: no large data being created/stored by hv).
that to me implies it should not be related to hv caching of datashaded images (those won't change in size based on the size of the data, right?), and should also not be related to notebook caching of outputs (for the same reason: no large data being created/stored by hv).
Not sure that's accurate, both the input and the output of the datashade operation are cached so whenever you drag the slider it will cache both the raw dataframe and the aggregated image in memory. My bet is that it is indeed the combination of the DynamicMap cache and Jupyter output caching that's causing this.
Cell execution: In Jean-Luc's example, the input data is being created every time, so I understand that the notebook cache would grow every time running the hv.Image cell. However, in the original example, the input data is created before hv is involved: hv will just point to it, not copy it...right?! (My suggestion was that maybe the pandas indexing operation was causing a copy, or something like that. Yes, the images will be cached in memory, but they aren't large compared to the data, right?)
Dragging slider: isn't that the same in the original example? The input data is already created, so hv would just point to it, not copy it? (Again, except if the data's being copied for some reason.)
I'm definitely not going to bet against you. I should have taken my own advice and stopped speculating ;)
I should have read his example more closely, I thought the dataframe was being created inside a function but as you point out it isn't. The other thing is that the example actually uses a HoloMap so dragging the slider should make no real difference, since everything should be pre-allocated. I do wonder if df[df.s == s]
is making copies for each subset though, in which case using a dask dataframe along with a dynamic groupby would be more efficient. Either way though, this:
hv.Layout([dynspread(datashade(hv.HoloMap([(s, pad_scatter(hv.Scatter(df[df.s == s], kdims=[x,'y']))) for s in scales])))
for x in ['x1', 'x2', 'x3', 'x4']]).cols(2)
is better expressed as:
dataset = hv.Dataset(df)
hv.Layout([dynspread(datashade(ds.to(hv.Scatter, x, 'y', 's').map(pad_scatter, hv.Scatter)))
for x in ['x1', 'x2', 'x3', 'x4']]).cols(2)
... datashading works via streams, so you should generally only ever have one value in the cache.
Good point! I forgot about that important detail - the DynamicMap
cache shouldn't be the issue here.
Update it appears that this is largely a system install issue! After some more testing, it appears that the only box that this occurs on is a Linux box we use at work that has a more ad hoc installation of holoviews, etc. When I run on OSX or a Linux setup that is set up via a self-contained environment.yml
and fresh conda install, I don't get the same issues. I'll work with our sysadmins to dig in.
I wanted to back up my suggestion that there's a copy happening...
Outside of the notebook, if I run the OP's "Evaluation cell" in a loop like this:
print("a",time.time()-t0)
for i in range(repeats):
if copy:
hv.Layout([dynspread(datashade(hv.HoloMap([(s, pad_scatter(hv.Scatter(df[df.s == s], kdims=[x,'y']))) for s in scales]))) for x in ['x1', 'x2', 'x3', 'x4']]).cols(2)
else:
hv.Layout([dynspread(datashade(hv.HoloMap([(s, pad_scatter(hv.Scatter(df , kdims=[x,'y']))) for s in scales]))) for x in ['x1', 'x2', 'x3', 'x4']]).cols(2)
print(i,time.time()-t0)
print("b",time.time()-t0)
I see the following with 'copy':
(hvdev) [170907 202555]~/code/ioam/holoviews2$ mprof run -T 0.2 testmem.py 5 1
mprof: Sampling memory every 0.2s
running as a Python program...
repeats=5 copy=True
df is around 218 MB
a 2.94058895111084
0 4.017941951751709
1 5.142473936080933
2 6.26465106010437
3 7.38116192817688
4 8.502644062042236
b 8.502682209014893
And without 'copy':
(hvdev) [170907 202636]~/code/ioam/holoviews2$ mprof run -T 0.2 testmem.py 5 0
mprof: Sampling memory every 0.2s
running as a Python program...
repeats=5 copy=False
df is around 218 MB
a 3.2328739166259766
0 6.466661214828491
1 9.706587076187134
2 12.81578016281128
3 15.873624086380005
4 19.001055002212524
b 19.001084089279175
Not that it's very exciting, but I've attached the script. mprof is memory_profiler 0.47 (https://pypi.python.org/pypi/memory_profiler). I'm using a mac with 16 GB ram.
Presumably in the first case python could at some point garbage collect the objects, but even then whether the memory would be "returned to the operating system" is not clear to me (I think it might depend on the platform).
Actually, here's the script, because I couldn't attach it as a .py file:
import time
t0 = time.time()
import sys
repeats, copy = sys.argv[1::]
repeats = int(repeats)
copy = False if copy=='0' else True
print("repeats=%s"%repeats,"copy=%s"%copy)
############################################################
### code from issue
import pandas as pd
import numpy as np
import holoviews as hv
from holoviews.operation.datashader import aggregate, shade, datashade, dynspread
import sys
hv.extension('bokeh')
n,k = 1_000_000,4
scales=np.linspace(1,10,k)
df = pd.concat([s * pd.DataFrame({
'x1' : np.random.randn(n),
'x2' : np.abs(np.random.randn(n)),
'x3' : np.random.chisquare(1, n),
'x4' : np.random.uniform(0,s,n),
'y' : np.random.randn(n),
's' : np.full(n, 1)
}) for s in scales])
def extend_range(p, frac):
a, b = np.min(p), np.max(p)
m, l = (a + b) / 2, (b - a) / 2
rv = (m - frac * l, m + frac * l)
return rv
def pad_scatter(s: hv.Scatter, frac=1.05):
df = s.dframe()
r = {d.name: extend_range(df[d.name], frac) for d in (s.kdims + s.vdims)[0:2]}
return s.redim.range(**r)
print(f'df is around {sys.getsizeof(df) // 1024_000} MB')
############################################################
print("a",time.time()-t0)
for i in range(repeats):
if copy:
hv.Layout([dynspread(datashade(hv.HoloMap([(s, pad_scatter(hv.Scatter(df[df.s == s], kdims=[x,'y']))) for s in scales]))) for x in ['x1', 'x2', 'x3', 'x4']]).cols(2)
else:
hv.Layout([dynspread(datashade(hv.HoloMap([(s, pad_scatter(hv.Scatter(df , kdims=[x,'y']))) for s in scales]))) for x in ['x1', 'x2', 'x3', 'x4']]).cols(2)
print(i,time.time()-t0)
print("b",time.time()-t0)
I also meant to add that I think this issue could be closed, because it doesn't seem to be a problem specific to holoviews. (The issue could always be reopened if new evidence is provided pointing at holoviews doing something wrong.)
@philippjfr If you can confirm that Chris is correct in saying that there is a memory leak, but it isn't HoloViews, then we should close this issue. I'm wondering if we should mention something about this in the docs, but I'm not quite sure where.
I think the issue should only be closed with no further action if it was all just a confusion. Whether or not the problem is HoloView's fault, if there is indeed a problem then either we need to mention it in the docs as a caveat, we need to explain how to avoid it in the docs, or we need to chase some other project to get it fixed. I can't tell from a quick scan of the above which one of these it is.
If you can confirm that Chris is correct in saying that there is a memory leak
I didn't actually say that :) I said there's a copy happening, and it's happening outside of holoviews (in "user-level pandas code").
I also said:
Presumably in the first case [where pandas copy is happening] python could at some point garbage collect the objects, but even then whether the memory would be "returned to the operating system" is not clear to me (I think it might depend on the platform).
As in, I was speculating that maybe hv creates a cycle (involving the dataframe), so the memory won't be freed until garbage collection happens? (Do you know of such a cycle in hv?) And also that even if the memory is "freed" (i.e. the copies of the dataframe get garbage collected), I'm not sure whether or not you will see it by looking at the operating system's report of memory usage by python.
However, beyond the dataframe copy by pandas, all that's just speculation: you may be right to refer to what's happening as a leak, but I guess we'd need to demonstrate that the memory just keeps growing forever in the loop/is never available again for python to use (even after gc runs). So I probably did stop too early to say there's no problem in holoviews.
This seems to be largely down to the fact that the user code made copies of the underlying data, so I'm going to close. I don't think it is particularly surprising that shallow copies of a dataframe made in a non-lazy way using df[df.s == s]
keep increasing memory usage when there is a persistent handle on the object so I don't think documenting this would be particularly helpful either.
This issue might be related to the Bokeh issue: https://github.com/bokeh/bokeh/issues/8626 Just leaving this here for findability.
I have similar problem with tf2, pandas and matplotlib and %reset out
actually helped, when gc.collect()
did not.
Thx, @philippjfr
Links for understanding how it works:
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I believe that with HoloViews 1.8.3 on Jupyter there is a non-trivial memory leak when repeatedly executing a cell. This creates an inconvenience when working/refining iteratively with large data sets.
I'm reporting this issue based on a real-world although admittedly somewhat complex use case, and I'll admit that I'm not sure I'm using HoloViews correctly. I'm seeing that as I repeatedly execute cells in a Jupyter notebook, the memory usage for the kernel grows without bound. This issues exists whether or not I'm using datashading and large datasets; since the memory increase is proportional to the data size, it's a lot more noticeable/problematic when there is a lot of data, I'll focus on the main case.
I'm combining several techniques in order to create a rich, user-friendly interface for reviewing my data. (Kudos to HoloViews for being able to do this at all!) The techniques are:
redim
and{+framewise}
to ensure that all displays are scaled properlyI've supplied some code below that is a non-proprietary repro of my use case. It definitely shows the same pattern of increased kernel memory for each cell invocation. Again, I'll say that I wrote it through trial and error, and I am by no means sure that I'm not abusing something and/or that there is a better way to accomplish the same things with HoloViews.
Initialization cell
Running this cell, I get
and my Jupyter kernel is around 1831M.
Evaluation cell
This gives me a very beautiful scaled layout of shaded scatters.
However, the memory usage as I repeatedly evaluate that cell in the notebook is: 2717M, 3455M, 4441M, 5307M etc.
In reality I'm working with much more data (dataframes of around 10-30GB), and even though I'm on a pretty beefy machine, it starts to become a fairly big problem as I poke around and do trial-and-error exploration. In reality I find myself having to restart the kernel pretty often.
I'm not using dask - maybe I should be - but I'm not sure that would fix the issue.
This issue does not appear to be related to datashader or the large size of the data. If I run something similar with much smaller
n
and using only aHoloMap
instead of datashading, I see a similar increase in memory - just obviously a much smaller slope becausen
is smaller.