pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.51k stars 17.88k forks source link

Example of High Memory Usage of MultiIndex #13904

Closed PeterKucirek closed 7 years ago

PeterKucirek commented 8 years ago

This is a dupe of #1752, in which @jreback recommended a new issue be opened. I have a (hopefully) reproduce-able example which shows how using MultiIndex can explode RAM usage.

At the top I define a function to get the memory of the current Python process. This will be called as I create arrays to get the "true" space of each object.

import numpy as np
import pandas as pd
import os
import psutil
import numba as nb

def print_mem():
    proc = psutil.Process(os.getpid())
    print proc.memory_info()[0] / float(2**20), "MB"
print_mem()
>> 74.20703125 MB

I need some functions to construct a plausible MultiIndex which matches roughly the actual data I'm using.

@nb.jit(nopython=True, nogil=True)
def nbf_number_new_level(repetitions):
    assert len(repetitions.shape) == 1

    n = repetitions.sum()
    ret = np.zeros(shape=n, dtype=np.uint64)

    i = 0
    for j in range(len(repetitions)):
        n_repetitions = repetitions[j]
        c = 0
        for k in range(n_repetitions):
            ret[i] = c
            c += 1; i += 1
    return ret

def expand_index(idx, repetitions, new_name=None):
    assert len(idx) == len(repetitions), "Index and repetitions must be the same length"
    repetitions = np.array(repetitions)
    assert np.all(repetitions > 0), "All repetitions must be > 0"

    names = list(idx.names) if isinstance(idx, pd.MultiIndex) else [idx.name]
    names.append(new_name)

    new_level_value = nbf_number_new_level(repetitions)

    value_arrays = [idx.get_level_values(i).repeat(repetitions) for i in range(idx.nlevels)] \
    if isinstance(idx, pd.MultiIndex) else \
    [idx.values.repeat(repetitions)]

    value_arrays.append(new_level_value)

    ret = pd.MultiIndex.from_arrays(value_arrays)
    ret.names = names
    return ret

Constructing the actual MultiIndex:

randomizer = np.random.RandomState(12345)

nhousehholds = 200000
hhids = pd.Index(np.arange(nhousehholds), name='hhid')
print nhousehholds, "households"
print_mem()
>>200000 households
>>83.95703125 MB

persons_per_hh = randomizer.randint(1, 9, size=nhousehholds)
person_ids = expand_index(hhids, persons_per_hh, new_name='pid')
print len(person_ids), "persons"
print_mem()
>>898254 persons
>>105.6953125 MB

tours_per_person = randomizer.choice([1, 2, 3], size=len(person_ids), replace=True, p=[0.85, 0.1, 0.05])
tour_ids = expand_index(person_ids, tours_per_person, new_name='tour_id')
print len(tour_ids), "tours"
print_mem()
>>1078155 tours
>>121.734375 MB

trips_per_tour = randomizer.choice([2,3,4,5], size=len(tour_ids), replace=True, p=[0.6, 0.25, 0.1, 0.05])
trip_ids = expand_index(tour_ids, trips_per_tour, new_name='trip_id')
print len(trip_ids), "trips"
print_mem()

trips = pd.DataFrame(index=trip_ids)
trips['origin'] = randomizer.randint(101, 200, size=len(trips))
trips['destination'] = randomizer.randint(101, 200, size=len(trips))
print_mem()

>>2803731 trips
>>150.17578125 MB
>>171.70703125 MB

Based on this, my fake trip table is taking up about 171.7 - 121.7 = 50MB of RAM. I haven't done any fancy indexing, just initialized the table. But, when I call trips.info() (which appears to instantiate the array of PyTuples) this happens:

print trips.info()
print_mem()
>><class 'pandas.core.frame.DataFrame'>
MultiIndex: 2803731 entries, (0, 0, 0, 0) to (199999, 0, 0, 1)
Data columns (total 2 columns):
origin         int32
destination    int32
dtypes: int32(2)
memory usage: 42.8+ MB
None
723.64453125 MB

My process's memory usage balloons to 723MB!. Doing the math, the cached indexer takes up 723.6 - 171.7 = 551 MB, a tenfold increase over the actual DataFrame!

For this fake dataset, this is not so much of a problem, but my production code is 20x the size and I soak up 27 GB of RAM when I as much as look at my trips table.

Any performance tips would be appreciated; the only thing I can think of right now is "don't use a MultiIndex". It sounds like a 'proper' fix for this lies deep in the indexing internals which is far above my pay grade. But I wanted to at least log this issue with the community.

Output of pd.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.4
pip: 8.1.1
setuptools: 0.6c11
Cython: 0.22
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 3.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.1.1
numexpr: 2.5.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.4
xlwt: 0.7.5
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
jinja2: 2.7.3
boto: 2.36.0
pandas_datareader: None
PeterKucirek commented 8 years ago

I don't know much about the indexing internals, but is it possible to design a "fast-path" MultiIndex when, say, level arrays are all integers? And fall back on the original behaviour when the MultiIndex gets more complicated?

chris-b1 commented 8 years ago

The problem here is that a array of tuples is created and cached on some operations. But the tuples aren't needed for many things you'd do with a MultiIndex. So an ugly hack you could do to prevent this from happening is:

def fail(self):
    raise RuntimeError("Not allowing boxed MultiIndex")

pd.MultiIndex.values = property(fail)

Then, you can still do most things with the MultiIndex, but if you try something that would materialize the tuples, you'll get an error. Not really recommended ... but kind of helps, or at least will let you know what step in your program is creating the tuples.

In [9]: trips.head()
Out[9]: 
                          origin  destination
hhid pid tour_id trip_id                     
0    0   0       0           107          122
                 1           159          113
                 2           115          152
     1   0       0           178          188
                 1           195          184

In [10]: idx = pd.IndexSlice

In [11]: trips.loc[idx[0, 1, :, 2], :]
Out[11]: 
                          origin  destination
hhid pid tour_id trip_id                     
0    1   0       2           185          178

In [12]: trips.groupby(level='tour_id').sum()
Out[12]: 
            origin  destination
tour_id                        
0        350398162    350323339
1         52599924     52592203
2         17548605     17554048

In [13]: trips.info()
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)

RuntimeError: Not allowing boxed MultiIndex
jreback commented 8 years ago

can a simpler example be created for this? (simpler in that with no numba)

jreback commented 8 years ago

yes, you don't ever actually materalize the tuples, unless .values is called (at which point they are cached; a MI is immutable so this is feasible)

jreback commented 8 years ago

the act of printing a frame is probably the culprit here

somewhere .values is called on the index when printing (and its prob then sliced), this materializes it.

jreback commented 8 years ago

Actually one could try disabling the caching of .values altogether (or maybe only do it below some certain length in order to not sacrifice perf on 'small/medium' sized things).

and see if perf suite suffers.

PeterKucirek commented 8 years ago

@jreback If dependence on Numba is a problem, then the @nb.jit() decorator can be removed and it should just work fine as a regular Python function, albeit a lot slower.

@chris-b1 I tried your trick, but unfortunately it just breaks all sorts of things.

For starters, I cannot even reproduce your index-slice (it makes a call to values)A loc[] lookup breaks if a column indexer is not specified. A call to is_monotonic_increasing also breaks.

EDIT: Only is_monotonic_increasing breaks.

chris-b1 commented 8 years ago

I'm on 0.18.1 as well - the slice I wrote doesn't call .values but if you slice with a tuple it will.

is_monotonic_increasing feels like it could/should be implemented without .values, but you're right, it does fail.

PeterKucirek commented 8 years ago

I've also found that values gets called when using MultiIndex.equals(), which is another method that I feel could be implemented without the tuples.

chris-b1 commented 8 years ago

Yeah, this is actually more problematic than I realized - it's the same problem noted in the original issue - that the tuples are what's being used for the underlying hash table - it's just that some ops, like groupby and slice, are smart enough to not need it.

PeterKucirek commented 8 years ago

It looks to me that, because of this problem, MultiIndex is fundamentally broken for large arrays. They just take up too much RAM to be useful in production code. Either MutliIndex needs be refactored so that PyTuples aren't used for values OR a number of methods need to be refactored to not call values at all. The former is more 'pure' and it sounds like @njsmith was looking into it when the original issue was raised (and @wesm agreed).

I'm under a time crunch so the most I can do for the moment is refactor my code to drop as many MultiIndex's as possible (of which I am using many). But I'm interested in learning about the indexing internals so I may revisit this.

chris-b1 commented 8 years ago

I agree that the underlying implementation just doesn't scale, and probably was always intended to be replaced.

I'm sure a well thought out PR fixing the internals would be accepted, but given the discussion around internals refactoring, this might be the type of the thing that is punted to "pandas 2.0"

jreback commented 8 years ago

actually this could / should be done independtly or pandas 2.0. This is clearly an area that could have improvement, but the API is not touched.

so @PeterKucirek @chris-b1 welcome improvements here in the current environment.

PeterKucirek commented 8 years ago

The previous thread contains a link to a stale branch which looks to implement a fix. So anyone working on this may not have to start from zero.

wesm commented 8 years ago

Totally with you on this. The complicated part is hashing, so we would need to devise a custom hash table implementation that does not rely on materialized PyTuple objects

jreback commented 7 years ago

so @PeterKucirek if you are interested in trying this out (still a WIP):

https://github.com/pandas-dev/pandas/compare/master...jreback:mi

this uses the new-ish hashing algos pandas.tools.hashing to hash the values of the MI, directly as uint64 (allowing the hashtable to be entirely done in a compact repr and w/o the gil), just like the other hashtables.

jreback commented 7 years ago

Here's my simple test (all done in separate sessions) 0.19.2

In [1]: i = MultiIndex.from_product([np.arange(1000), np.arange(10), np.arange(10), np.arange(10), np.arange(3)])

In [3]: len(i)
Out[3]: 3000000

In [2]: %time i.get_loc((0,0,1,0,1))
CPU times: user 2.04 s, sys: 238 ms, total: 2.27 s
Wall time: 2.27 s
Out[2]: 31

In [2]: %memit i.get_loc((0,0,1,0,1))
peak memory: 583.14 MiB, increment: 486.66 MiB

This PR

In [2]: %time i.get_loc((0,0,1,0,1))
CPU times: user 632 ms, sys: 286 ms, total: 918 ms
Wall time: 924 ms
Out[2]: 31

In [3]: %memit i.get_loc((0,0,1,0,1))
peak memory: 423.96 MiB, increment: 318.61 MiB
PeterKucirek commented 7 years ago

Ok, I will try it out and report back.

mahsa-ebrahimian commented 4 years ago

@jreback @PeterKucirek I am having the same problem of RAM explosion when using multi index, I followed your discussion but I cant find out how did you solve the issue finally? Could you kindly let me know.

This is my pull request adress: mahsa1991ebrahimian-netflix-OutOfMemory

jreback commented 4 years ago

you would have o show a reproducible example - pls open a new issue

PeterKucirek commented 4 years ago

@jreback @PeterKucirek I am having the same problem of RAM explosion when using multi index, I followed your discussion but I cant find out how did you solve the issue finally? Could you kindly let me know.

This is my pull request adress: mahsa1991ebrahimian-netflix-OutOfMemory

I switched to a newer version of Pandas, which is lighter on memory for large MultiIndexes. I don't recall the specific version, but I'm mostly using 0.23 for work, and that version definitely has better performance.

mahsa-ebrahimian commented 4 years ago

@jreback here is the new issue I created, could you please have a look https://github.com/mahsa-ebrahimian/netflix_project/issues/1#issue-680536454

jreback commented 4 years ago

on the pandas tracker

and it should be reproducible with code

mahsa-ebrahimian commented 4 years ago

on the pandas tracker

and it should be reproducible with code

Sorry but I dont get your point about pandas tracker, I am bit new to github, Could you please clarify what do you mean by on pandas tracker? and how my code is not reproducible?

jreback commented 4 years ago

this is the pandas tracker open a new issue

the post should have an example that can be copy pasted to reproduce

mahsa-ebrahimian commented 4 years ago

this is the pandas tracker open a new issue

the post should have an example that can be copy pasted to reproduce

Thank you, it is done now . https://github.com/pandas-dev/pandas/issues/36074#issue-691203711