Every `uproot.dask` call increases memory footprint by 30 MB (it's in `dask.base.function_cache`)

reproducer:

import uproot

for _ in range(200):
    uproot.dask({
    "https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"
    })

This particular instance leaks ~30MB per open. This adds up very quickly if you need to extract the form of hundreds of files in a remote process as evident from https://github.com/CoffeaTeam/coffea/issues/1007 where this bug manifested pretty nastily.

This is true, and the memory use in uproot.dask is over and above uproot.open (and getting the TTree metadata).

import gc
import psutil
import uproot

this_process = psutil.Process()

def memory_diff(task):
    gc.disable()
    gc.collect()
    start_memory = this_process.memory_full_info().uss
    task()
    gc.collect()
    stop_memory = this_process.memory_full_info().uss
    gc.enable()
    return stop_memory - start_memory

def task():
    with uproot.open(
        {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
    ) as tree:
        pass

for _ in range(200):
    print(f"{memory_diff(task) * 1e-6:.3f} MB")

reports

28.156 MB
1.483 MB
0.000 MB
0.008 MB
0.012 MB
0.004 MB
3.932 MB
0.262 MB
0.000 MB
0.000 MB
0.000 MB
-1.040 MB
0.000 MB
0.803 MB
0.246 MB
0.000 MB
0.000 MB
0.000 MB
0.000 MB
-1.040 MB
0.807 MB
0.242 MB
0.004 MB
0.000 MB
0.000 MB
0.000 MB
...

Change the task to

def task():
    tree = uproot.open(
        {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
    )

so that it leaks file handles, and it's

26.059 MB
1.499 MB
0.004 MB
0.012 MB
0.033 MB
0.004 MB
0.000 MB
-0.012 MB
6.046 MB
0.258 MB
0.004 MB
0.000 MB
-1.049 MB
0.008 MB
-1.049 MB
0.000 MB
0.000 MB
...

We'd eventually run out of file handles this way, but apparently not memory (on the MB scale).

Now

def task():
    lazy = uproot.dask(
        {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
    )

87.364 MB
27.988 MB
25.252 MB
28.987 MB
25.227 MB
26.903 MB
25.219 MB
30.024 MB
23.155 MB
27.898 MB
26.325 MB
28.975 MB
24.158 MB
28.991 MB
...

This is a problem. (Also, it's noticeably slower, though there might be good reasons for that.)

Using Pympler,

>>> import gc
>>> import pympler.tracker
>>> import uproot
>>> 
>>> summary_tracker = pympler.tracker.SummaryTracker()
>>> 
>>> # run it once to get past the necessary first-time things (filling uproot.classes, etc.)
>>> lazy = uproot.dask(
...     {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
... )
>>> del lazy
>>> gc.collect()
0
>>> # run print_diff enough times to get to the quiescent state
>>> summary_tracker.print_diff()
...
>>> summary_tracker.print_diff()
  types |   # objects |   total size
======= | =========== | ============
>>> 
>>> # what does an Uproot Dask array bring in?
>>> lazy = uproot.dask(
...     {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
... )
>>> summary_tracker.print_diff()
                                            types |   # objects |   total size
================================================= | =========== | ============
                                             dict |       72059 |     11.54 MB
                                            bytes |           3 |      5.66 MB
                                             list |       24007 |      1.73 MB
                                      numpy.int64 |       37491 |      1.14 MB
                      uproot.source.cursor.Cursor |       21000 |    984.38 KB
                                    numpy.ndarray |        4501 |    492.30 KB
                                              str |        4886 |    436.10 KB
              uproot.models.TObject.Model_TObject |        5999 |    281.20 KB
                                            tuple |        3385 |    147.00 KB
          uproot.models.TObjArray.Model_TObjArray |        3000 |    140.62 KB
                uproot.models.TNamed.Model_TNamed |        2999 |    140.58 KB
                                        frozenset |           1 |    128.21 KB
                                              int |        3492 |     95.51 KB
      awkward._nplikes.typetracer.TypeTracerArray |        1878 |     88.03 KB
  uproot.models.TTree.Model_ROOT_3a3a_TIOFeatures |        1500 |     70.31 KB
>>> 
>>> # what goes away when we delete it?
>>> del lazy
>>> gc.collect()
14
>>> gc.collect()
0
>>> summary_tracker.print_diff()
                                         types |   # objects |   total size
============================================== | =========== | ============
                                          code |           0 |     37     B
                  aiohttp.helpers.TimerContext |          -1 |    -48     B
      awkward.contents.recordarray.RecordArray |          -1 |    -48     B
            dask.highlevelgraph.HighLevelGraph |          -1 |    -48     B
             dask_awkward.utils.LazyInputsDict |          -1 |    -48     B
               dask.blockwise.BlockwiseDepDict |          -1 |    -48     B
                       awkward.highlevel.Array |          -1 |    -48     B
  dask_awkward.layers.layers.AwkwardInputLayer |          -1 |    -48     B
                   dask_awkward.lib.core.Array |          -1 |    -48     B
                asyncio.trsock.TransportSocket |          -2 |    -80     B
                                 ssl.SSLObject |          -2 |    -96     B
                  aiohttp.streams.StreamReader |          -2 |    -96     B
                  asyncio.sslproto.SSLProtocol |          -2 |    -96     B
                     asyncio.sslproto._SSLPipe |          -2 |    -96     B
                                     bytearray |          -2 |   -112     B

Hardly anything goes away when lazy is deleted! That's not good!

This TTree has 1499 TBranches. So having approximately that many TIOFeatures, TypeTracerArray, twice that many Model_TNamed, Model_TObjArray (TBranch and TLeaf), and four times as many Model_TObject make sense.

There are only 3 bytes objects, but they comprise 5.66 MB. I don't know, offhand, what they could be, but I think they're more likely Uproot than Dask. There are a lot of big dicts, which is not too surprising, and I can't say offhand whether I expect more in Uproot or more in Dask.

The one, major problem is that del lazy followed by gc.collect() does not get rid of as many objects as were brought in. It may be reasonable for the dask-awkward array of a large TTree to be 30 MB, but it's not reasonable for it to still be around after deleting.

Who gets a reference to it and doesn't let go? It might be possible to find out with gc.get_referrers, but it might not be at the level of lazy (the ak.Array and dak.Array are listed as objects that go away). Let me think about that...

Okay, setting up to follow this object with gc.get_referrers,

>>> import uproot
>>> import gc
>>> lazy = uproot.dask(
...     {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
... )
>>> type(lazy)
<class 'dask_awkward.lib.core.Array'>
>>> type(lazy._meta)
<class 'awkward.highlevel.Array'>

I'll be looking at lists and one reference will be the list I'm using to look at it, so I make that a special class that's easier to ignore in a print-out of type names.

>>> class IgnoreMeList(list):
...     pass
... 
>>> def show(follow):
...     print("\n".join(f"{i:2d} {type(x).__module__}.{type(x).__name__}" for i, x in enumerate(follow)))
...

In the Pympler output, we saw that the TypeTracerArrays were not deleted when lazy went out of scope (and gc.collect() was called). So this is a good starting point to walk outward and find out who's holding a reference to it.

>>> follow = IgnoreMeList([lazy._meta.layout.content("Muon_pt").content.data])
>>> show(follow)
 0 awkward._nplikes.typetracer.TypeTracerArray

Now I'll just walk along the graph of its referrers, ignoring the IgnoreMeList because that's the follow list itself, and seeing what else is in the list.

>>> follow = IgnoreMeList(gc.get_referrers(follow[0]))
>>> show(follow)
 0 __main__.IgnoreMeList
 1 builtins.dict
>>> 
>>> follow = IgnoreMeList(gc.get_referrers(follow[1]))
>>> show(follow)
 0 __main__.IgnoreMeList
 1 awkward.contents.numpyarray.NumpyArray
>>> 
>>> follow = IgnoreMeList(gc.get_referrers(follow[1]))
>>> show(follow)
 0 __main__.IgnoreMeList
 1 builtins.dict
>>> 
>>> follow = IgnoreMeList(gc.get_referrers(follow[1]))
>>> show(follow)
 0 __main__.IgnoreMeList
 1 awkward.contents.listoffsetarray.ListOffsetArray
>>> 
>>> follow = IgnoreMeList(gc.get_referrers(follow[1]))
>>> show(follow)
 0 __main__.IgnoreMeList
 1 builtins.list
>>> 
>>> follow = IgnoreMeList(gc.get_referrers(follow[1]))
>>> show(follow)
 0 builtins.dict
 1 __main__.IgnoreMeList
>>> 
>>> follow = IgnoreMeList(gc.get_referrers(follow[0]))
>>> show(follow)
 0 awkward.contents.recordarray.RecordArray
 1 __main__.IgnoreMeList
>>> 
>>> follow = IgnoreMeList(gc.get_referrers(follow[0]))
>>> show(follow)
 0 builtins.dict
 1 __main__.IgnoreMeList
>>> 
>>> follow = IgnoreMeList(gc.get_referrers(follow[0]))
>>> show(follow)
 0 awkward.highlevel.Array
 1 __main__.IgnoreMeList
>>> 
>>> follow = IgnoreMeList(gc.get_referrers(follow[0]))
>>> show(follow)
 0 builtins.dict
 1 __main__.IgnoreMeList
>>> 
>>> follow = IgnoreMeList(gc.get_referrers(follow[0]))
>>> show(follow)
 0 dask_awkward.lib.core.Array
 1 __main__.IgnoreMeList
>>> 
>>> follow = IgnoreMeList(gc.get_referrers(follow[0]))
>>> show(follow)
 0 __main__.IgnoreMeList
 1 builtins.dict
>>> 
>>> follow = IgnoreMeList(gc.get_referrers(follow[1]))
>>> show(follow)
 0 builtins.function
 1 builtins.dict
 2 __main__.IgnoreMeList
 3 builtins.module

Okay! The dict and the module are just __main__:

>>> follow[3]
<module '__main__' (built-in)>
>>> follow[1].keys()
dict_keys(['use_main_ns', 'namespace', 'matches'])
>>> follow[1]["use_main_ns"]
1
>>> follow[1]["matches"]
['follow']
>>> type(follow[1]["namespace"])
<class 'dict'>
>>> follow[1]["namespace"].keys()
dict_keys(['__name__', '__doc__', '__package__', '__loader__', '__spec__', '__annotations__', '__builtins__', 'uproot', 'gc', 'lazy', 'IgnoreMeList', 'show', 'follow'])

So what about the function?

>>> follow[0]
<function show at 0x77a70b5889d0>

Nope. All of this is either Python's infrastructure or the infrastructure I set up in the __main__ namespace.

So what gives? I didn't see any other referrers along the way. Who's holding a reference to this object? If nobody is, why isn't it deleted (why is it not negative in the Pympler list) when the ak.Array that holds it is deleted?

Does anyone have any ideas?

Even more to the point, following the advice of https://stackoverflow.com/a/28406001/1623645

>>> import uproot
>>> import gc
>>> lazy = uproot.dask(
...     {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
... )
>>> class IgnoreMeList(list):
...     pass
... 
>>> def show(follow):
...     print("\n".join(f"{i:2d} {type(x).__module__}.{type(x).__name__}" for i, x in enumerate(follow)))
... 
>>> follow = IgnoreMeList([lazy._meta.layout.content("Muon_pt").content.data])
>>> del lazy
>>> gc.collect()
17
>>> gc.collect()
0
>>> show(follow)
 0 awkward._nplikes.typetracer.TypeTracerArray
>>> follow = IgnoreMeList(gc.get_referrers(follow[0]))
>>> gc.collect()
0
>>> show(follow)
 0 __main__.IgnoreMeList

The TypeTracerArray goes away. I don't know why this disagrees with Pympler (and the fact that 30 MB of USS doesn't go away).

It looks to me like this might be in dask. Add the following to the loop body:

import gc
import dask.base
dask.base.function_cache.clear()
gc.collect()

I notice that the total memory usage remains fairly stable.

If it's referenced in dask.base.function_cache, I would have thought that gc.get_referrers would have shown us that. Also, clearing this function cache doesn't show the allocated data getting removed in Pympler:

>>> import gc
>>> import pympler.tracker
>>> import dask.base
>>> import uproot
>>> 
>>> summary_tracker = pympler.tracker.SummaryTracker()
>>> 
>>> lazy = uproot.dask(
...     {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
... )
>>> del lazy
>>> gc.collect()
3
>>> gc.collect()
0
>>> summary_tracker.print_diff()
#                         ... several times ...                         #
>>> summary_tracker.print_diff()
  types |   # objects |   total size
======= | =========== | ============
>>> lazy = uproot.dask(
...     {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
... )
>>> summary_tracker.print_diff()
                                        types |   # objects |   total size
============================================= | =========== | ============
                                         dict |       72060 |     11.54 MB
                                        bytes |           3 |      5.66 MB
                                         list |       24007 |      1.73 MB
                                  numpy.int64 |       37491 |      1.14 MB
                  uproot.source.cursor.Cursor |       21000 |    984.38 KB
                                numpy.ndarray |        4501 |    492.30 KB
                                          str |        4886 |    436.10 KB
          uproot.models.TObject.Model_TObject |        5999 |    281.20 KB
                                        tuple |        3385 |    147.00 KB
      uproot.models.TObjArray.Model_TObjArray |        3000 |    140.62 KB
            uproot.models.TNamed.Model_TNamed |        2999 |    140.58 KB
                                    frozenset |           1 |    128.21 KB
                                          int |        3490 |     95.45 KB
  awkward._nplikes.typetracer.TypeTracerArray |        1878 |     88.03 KB
         uproot.models.TAtt.Model_TAttFill_v2 |        1500 |     70.31 KB
>>> del lazy
>>> dask.base.function_cache.clear()   # clearing Dask's function cache
>>> gc.collect()
221296
>>> gc.collect()
0
>>> summary_tracker.print_diff()
                                         types |   # objects |   total size
============================================== | =========== | ============
                                          code |           0 |     37     B
                       awkward.highlevel.Array |          -1 |    -48     B
               dask.blockwise.BlockwiseDepDict |          -1 |    -48     B
                  aiohttp.helpers.TimerContext |          -1 |    -48     B
      awkward.contents.recordarray.RecordArray |          -1 |    -48     B
  dask_awkward.layers.layers.AwkwardInputLayer |          -1 |    -48     B
            dask.highlevelgraph.HighLevelGraph |          -1 |    -48     B
                   dask_awkward.lib.core.Array |          -1 |    -48     B
             dask_awkward.utils.LazyInputsDict |          -1 |    -48     B
                asyncio.trsock.TransportSocket |          -2 |    -80     B
                     asyncio.sslproto._SSLPipe |          -2 |    -96     B
                     fsspec.caching.BytesCache |          -2 |    -96     B
                                 ssl.SSLObject |          -2 |    -96     B
           uproot._dask.TrivialFormMappingInfo |          -2 |    -96     B
           uproot.models.TTree.Model_TTree_v20 |          -2 |    -96     B

It's still the case that creating lazy adds 30 MB of TTree metadata objects and deleting it and the functions from the function cache doesn't make them appear with a minus sign in Pympler.

Oh, but the total USS memory usage does go down:

import gc
import psutil
import dask.base
import uproot

this_process = psutil.Process()

def memory_diff(task):
    gc.disable()
    gc.collect()
    start_memory = this_process.memory_full_info().uss
    task()
    gc.collect()
    stop_memory = this_process.memory_full_info().uss
    gc.enable()
    return stop_memory - start_memory

def task():
    lazy = uproot.dask(
        {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
    )
    del lazy
    dask.base.function_cache.clear()

for _ in range(200):
    print(f"{memory_diff(task) * 1e-6:.3f} MB")

results in

62.751 MB
18.416 MB
1.073 MB
-3.138 MB
2.105 MB
0.053 MB
-2.064 MB
2.077 MB
-1.032 MB
0.000 MB
2.109 MB
-4.170 MB
4.174 MB
-2.077 MB
-0.020 MB
0.020 MB
1.016 MB
-1.008 MB
-0.016 MB
0.016 MB
2.089 MB
-2.073 MB
2.085 MB
...

whereas removing the function_cache.clear() results in

79.806 MB
27.992 MB
25.338 MB
30.044 MB
21.045 MB
30.056 MB
23.114 MB
31.076 MB
24.187 MB
27.918 MB
24.207 MB
26.882 MB
26.259 MB
...

So that is what's holding all of the memory. It must be some connection that Python doesn't see—maybe it goes through a reference in an extension module? (Maybe it goes through a NumPy object array? numpy/numpy#6581)

Since this is a Dask feature, what do we want to do about it? @lgray, would it be sufficient to have Coffea clear the Dask function cache?

That'll certainly fix it for coffea.

One thing I noticed in function_cache is that it's holding the uncompressed pickle of the function and the dask.base.tokenize for the function key changes with every uproot.dask open.

That's probably why we don't see a connection to lazy above.

That's it then! That's why it costs memory, but can't be seen as objects of the expected types.

So, in the end, the recommendation for everyone is to check their Dask function cache. I'll convert this into a Discussion as a way for others to find this conclusion.

Actually, it could—possibly—be fixed in Uproot by replacing the TTree metadata data structure with a streamlined data structure containing only that which is necessary to fetch arrays. (Mostly the fBasketSeek, etc.)

>>> import sys
>>> import uproot
>>> tree = uproot.open(
...     {"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"}
... )
>>> minimal = {}
>>> for k, v in tree.items():
...     minimal[k, "seek"] = v.member("fBasketSeek")[:v.num_baskets]
...     minimal[k, "bytes"] = v.member("fBasketBytes")[:v.num_baskets]
...     minimal[k, "entry"] = v.member("fBasketEntry")[:v.num_baskets + 1]
... 
>>> sum(sys.getsizeof(k) + sys.getsizeof(v) for k, v in minimal.items()) / 1024**2
0.7204971313476562

i.e. something like 0.7 MiB for this file, but larger if it had more baskets. It's likely that I'm forgetting some other essential metadata, which would bring this figure up.

scikit-hep / uproot5

Every `uproot.dask` call increases memory footprint by 30 MB (it's in `dask.base.function_cache`) #1093