scikit-hep / uproot3

ROOT I/O in pure Python and NumPy.
BSD 3-Clause "New" or "Revised" License
314 stars 67 forks source link

Value too large for lazyarray #499

Closed bdrum closed 4 years ago

bdrum commented 4 years ago

Hi.

First of all many thanks for the nice tool!

I've got such error when try to use lazyarray instead of just array:

Such code

events = uproot.open(UPCFiles.ccup9_2015_win)['events']
data = events.lazyarrays("*") 
data["HasPointOnITSLayer0"]

has provided to me the error:

ValueError                                Traceback (most recent call last)
C:\Python38\lib\site-packages\IPython\core\formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

C:\Python38\lib\site-packages\IPython\lib\pretty.py in pretty(self, obj)
    392                         if cls is not object \
    393                                 and callable(cls.__dict__.get('__repr__')):
--> 394                             return _repr_pprint(obj, self, cycle)
    395 
    396             return _default_pprint(obj, self, cycle)

C:\Python38\lib\site-packages\IPython\lib\pretty.py in _repr_pprint(obj, p, cycle)
    698     """A pprint that just redirects to the normal repr function."""
    699     # Find newlines and replace them with p.break_()
--> 700     output = repr(obj)
    701     lines = output.splitlines()
    702     with p.group():

C:\Python38\lib\site-packages\awkward\array\base.py in __repr__(self)
    109 
    110     def __repr__(self):
--> 111         return "<{0} {1} at 0x{2:012x}>".format(self.__class__.__name__, str(self), id(self))
    112 
    113     @property

C:\Python38\lib\site-packages\awkward\array\chunked.py in __str__(self)
    271     def __str__(self):
    272         if self.chunksizesknown:
--> 273             return super(ChunkedArray, self).__str__()
    274         else:
    275             strs = [self._util_arraystr(x) for x in self[:7].__iter__(checkiter=False)]

C:\Python38\lib\site-packages\awkward\array\base.py in __str__(self)
    102             if isinstance(first, AwkwardArray):
    103                 first = first.__iter__(checkiter=False)
--> 104             last = self[-3:]
    105             if isinstance(first, AwkwardArray):
    106                 last = last.__iter__(checkiter=False)

C:\Python38\lib\site-packages\awkward\array\chunked.py in __getitem__(self, where)
    426 
    427                 # add a sliced chunk
--> 428                 chunk = self._chunks[chunkid][(slice(local_start, local_stop, step),)]
    429                 if len(chunk) > 0:
    430                     chunk = chunk[(slice(None),) + tail]

C:\Python38\lib\site-packages\awkward\array\virtual.py in __getitem__(self, where)
    367 
    368     def __getitem__(self, where):
--> 369         return self.array[where]
    370 
    371     def __setitem__(self, where, what):

C:\Python38\lib\site-packages\awkward\array\virtual.py in array(self)
    293         if self._array is None:
    294             # states (1) and (3)
--> 295             return self.materialize()
    296 
    297         elif self._cache is None:

C:\Python38\lib\site-packages\awkward\array\virtual.py in materialize(self)
    324 
    325     def materialize(self):
--> 326         array = self._util_toarray(self._generator(*self._args, **self._kwargs), self.DEFAULTTYPE)
    327         if self._setitem is not None:
    328             for n, x in self._setitem.items():

C:\Python38\lib\site-packages\uproot\tree.py in __call__(self, branch, entrystart, entrystop)
   1916 
   1917     def __call__(self, branch, entrystart, entrystop):
-> 1918         return self.tree[branch].array(interpretation=self.interpretation[branch], entrystart=entrystart, entrystop=entrystop, flatten=self.flatten, awkwardlib=self.awkwardlib, cache=None, basketcache=self.basketcache, keycache=self.keycache, executor=self.executor)
   1919 
   1920 class _LazyBranch(object):

C:\Python38\lib\site-packages\uproot\tree.py in array(self, interpretation, entrystart, entrystop, flatten, awkwardlib, cache, basketcache, keycache, executor, blocking)
   1432         if executor is None:
   1433             for j in range(basketstop - basketstart):
-> 1434                 _delayedraise(fill(j))
   1435             excinfos = ()
   1436         else:

C:\Python38\lib\site-packages\uproot\tree.py in _delayedraise(excinfo)
     57             exec("raise cls, err, trc")
     58         else:
---> 59             raise err.with_traceback(trc)
     60 
     61 def _filename_explode(x):

C:\Python38\lib\site-packages\uproot\tree.py in fill(j)
   1400                 i = j + basketstart
   1401                 local_entrystart, local_entrystop = self._localentries(i, entrystart, entrystop)
-> 1402                 source = self._basket(i, interpretation, local_entrystart, local_entrystop, awkward, basketcache, keycache)
   1403 
   1404                 expecteditems = basket_itemoffset[j + 1] - basket_itemoffset[j]

C:\Python38\lib\site-packages\uproot\tree.py in _basket(self, i, interpretation, local_entrystart, local_entrystop, awkward, basketcache, keycache)
   1186 
   1187         if basketcache is not None:
-> 1188             basketcache[basketcachekey] = basketdata
   1189 
   1190         if key._fObjlen == key.border:

C:\Python38\lib\site-packages\uproot\cache.py in __setitem__(self, where, what)
     65     def __setitem__(self, where, what):
     66         with self._lock:
---> 67             self._cache[where] = what
     68 
     69     def __delitem__(self, where):

C:\Python38\lib\site-packages\cachetools\lru.py in __setitem__(self, key, value, cache_setitem)
     17 
     18     def __setitem__(self, key, value, cache_setitem=Cache.__setitem__):
---> 19         cache_setitem(self, key, value)
     20         self.__update(key)
     21 

C:\Python38\lib\site-packages\cachetools\cache.py in __setitem__(self, key, value)
     45         size = self.getsizeof(value)
     46         if size > maxsize:
---> 47             raise ValueError('value too large')
     48         if key not in self.__data or self.__size[key] < size:
     49             while self.__currsize + size > maxsize:

ValueError: value too large

I've found out that in case of I added the index like that:

data["HasPointOnITSLayer0"][0]
array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    1, 1717, 1717, 1717, 1717, 1717,
       1717, 1717, 1717, 1717, 1717, 1717, 1717, 1717, 1717, 1717, 1717,
       1717, 1717, 1717,  ....

It works, but I would like to work without indexes as in numpy style, I mean masked array: Something like this:

pt = data["Pt"][data["HasPointOnITSLayer0"] == 1]

Just for the info I have 1 135 259 events and this branch (HasPointOnITSLayer0) has such description asdtype("('>i4', (177,))").

I will use this issue also for asking a question about performance features:

Actually this tree works fine without any special features e.g. dask, lazy array and so on, but I have another one that 8x times greater.

This is the reason why I've started to use lazy arrays, but perhaps someone could advice me some 'the best practice' scheme for working with such data volumes, because just arrays

  1. takes all my memory
  2. slow. e.g. via root rdataframe I able to process 10Gb files for 1 minute, but I don't have background in python and as I understood I have to use multiprocessing module.
tamasgal commented 4 years ago

The default cache is too small in your case. You can define and pass your own cache to lazyarray() like this, in this example a 23MB cache:

data = events.lazyarrays("*", basketcache=uproot.cache.ThreadSafeArrayCache(23 * 1024**2)))
jpivarski commented 4 years ago

That's right—this error message comes from the cachetools library, which we don't control, and it's a pretty mysterious message. (The cache is too small to add this one new item to it. In cases like that, it should probably accept that the cache will exceed the limit or just not add the item.) This is getting fixed.

bdrum commented 4 years ago

Thank you! It works!

Actually I tried to use

mycache = uproot.cache.ArrayCache(100 * 1024 * 1024)  
itsl0 = events.lazyarray("HasPointOnITSLayer0", cache=mycache) 

but result was the same...

Okay, seems I have to study full description instead of code pieces 😅