Closed PeterKucirek closed 7 years ago
I don't know much about the indexing internals, but is it possible to design a "fast-path" MultiIndex when, say, level arrays are all integers? And fall back on the original behaviour when the MultiIndex gets more complicated?
The problem here is that a array of tuples is created and cached on some operations. But the tuples aren't needed for many things you'd do with a MultiIndex
. So an ugly hack you could do to prevent this from happening is:
def fail(self):
raise RuntimeError("Not allowing boxed MultiIndex")
pd.MultiIndex.values = property(fail)
Then, you can still do most things with the MultiIndex
, but if you try something that would materialize the tuples, you'll get an error. Not really recommended ... but kind of helps, or at least will let you know what step in your program is creating the tuples.
In [9]: trips.head()
Out[9]:
origin destination
hhid pid tour_id trip_id
0 0 0 0 107 122
1 159 113
2 115 152
1 0 0 178 188
1 195 184
In [10]: idx = pd.IndexSlice
In [11]: trips.loc[idx[0, 1, :, 2], :]
Out[11]:
origin destination
hhid pid tour_id trip_id
0 1 0 2 185 178
In [12]: trips.groupby(level='tour_id').sum()
Out[12]:
origin destination
tour_id
0 350398162 350323339
1 52599924 52592203
2 17548605 17554048
In [13]: trips.info()
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
RuntimeError: Not allowing boxed MultiIndex
can a simpler example be created for this? (simpler in that with no numba)
yes, you don't ever actually materalize the tuples, unless .values
is called (at which point they are cached; a MI is immutable so this is feasible)
the act of printing a frame is probably the culprit here
somewhere .values
is called on the index when printing (and its prob then sliced), this materializes it.
Actually one could try disabling the caching of .values
altogether (or maybe only do it below some certain length in order to not sacrifice perf on 'small/medium' sized things).
and see if perf suite suffers.
@jreback If dependence on Numba is a problem, then the @nb.jit()
decorator can be removed and it should just work fine as a regular Python function, albeit a lot slower.
@chris-b1 I tried your trick, but unfortunately it just breaks all sorts of things.
For starters, I cannot even reproduce your index-slice (it makes a call to values
)A loc[]
lookup breaks if a column indexer is not specified. A call to is_monotonic_increasing
also breaks.
EDIT: Only is_monotonic_increasing
breaks.
I'm on 0.18.1 as well - the slice I wrote doesn't call .values
but if you slice with a tuple it will.
is_monotonic_increasing
feels like it could/should be implemented without .values
, but you're right, it does fail.
I've also found that values
gets called when using MultiIndex.equals()
, which is another method that I feel could be implemented without the tuples.
Yeah, this is actually more problematic than I realized - it's the same problem noted in the original issue - that the tuples are what's being used for the underlying hash table - it's just that some ops, like groupby and slice, are smart enough to not need it.
It looks to me that, because of this problem, MultiIndex is fundamentally broken for large arrays. They just take up too much RAM to be useful in production code. Either MutliIndex needs be refactored so that PyTuples aren't used for values
OR a number of methods need to be refactored to not call values
at all. The former is more 'pure' and it sounds like @njsmith was looking into it when the original issue was raised (and @wesm agreed).
I'm under a time crunch so the most I can do for the moment is refactor my code to drop as many MultiIndex's as possible (of which I am using many). But I'm interested in learning about the indexing internals so I may revisit this.
I agree that the underlying implementation just doesn't scale, and probably was always intended to be replaced.
I'm sure a well thought out PR fixing the internals would be accepted, but given the discussion around internals refactoring, this might be the type of the thing that is punted to "pandas 2.0"
actually this could / should be done independtly or pandas 2.0. This is clearly an area that could have improvement, but the API is not touched.
so @PeterKucirek @chris-b1 welcome improvements here in the current environment.
The previous thread contains a link to a stale branch which looks to implement a fix. So anyone working on this may not have to start from zero.
Totally with you on this. The complicated part is hashing, so we would need to devise a custom hash table implementation that does not rely on materialized PyTuple objects
so @PeterKucirek if you are interested in trying this out (still a WIP):
https://github.com/pandas-dev/pandas/compare/master...jreback:mi
this uses the new-ish hashing algos pandas.tools.hashing
to hash the values of the MI, directly as uint64
(allowing the hashtable to be entirely done in a compact repr and w/o the gil), just like the other hashtables.
.values
can still be called (to instantiate tuples), it just isn't called on populating the hash table (but can be elsewhere, which is still problematic, but one step at a time).Here's my simple test (all done in separate sessions) 0.19.2
In [1]: i = MultiIndex.from_product([np.arange(1000), np.arange(10), np.arange(10), np.arange(10), np.arange(3)])
In [3]: len(i)
Out[3]: 3000000
In [2]: %time i.get_loc((0,0,1,0,1))
CPU times: user 2.04 s, sys: 238 ms, total: 2.27 s
Wall time: 2.27 s
Out[2]: 31
In [2]: %memit i.get_loc((0,0,1,0,1))
peak memory: 583.14 MiB, increment: 486.66 MiB
This PR
In [2]: %time i.get_loc((0,0,1,0,1))
CPU times: user 632 ms, sys: 286 ms, total: 918 ms
Wall time: 924 ms
Out[2]: 31
In [3]: %memit i.get_loc((0,0,1,0,1))
peak memory: 423.96 MiB, increment: 318.61 MiB
Ok, I will try it out and report back.
@jreback @PeterKucirek I am having the same problem of RAM explosion when using multi index, I followed your discussion but I cant find out how did you solve the issue finally? Could you kindly let me know.
This is my pull request adress: mahsa1991ebrahimian-netflix-OutOfMemory
you would have o show a reproducible example - pls open a new issue
@jreback @PeterKucirek I am having the same problem of RAM explosion when using multi index, I followed your discussion but I cant find out how did you solve the issue finally? Could you kindly let me know.
This is my pull request adress: mahsa1991ebrahimian-netflix-OutOfMemory
I switched to a newer version of Pandas, which is lighter on memory for large MultiIndexes. I don't recall the specific version, but I'm mostly using 0.23 for work, and that version definitely has better performance.
@jreback here is the new issue I created, could you please have a look https://github.com/mahsa-ebrahimian/netflix_project/issues/1#issue-680536454
on the pandas tracker
and it should be reproducible with code
on the pandas tracker
and it should be reproducible with code
Sorry but I dont get your point about pandas tracker, I am bit new to github, Could you please clarify what do you mean by on pandas tracker? and how my code is not reproducible?
this is the pandas tracker open a new issue
the post should have an example that can be copy pasted to reproduce
this is the pandas tracker open a new issue
the post should have an example that can be copy pasted to reproduce
Thank you, it is done now . https://github.com/pandas-dev/pandas/issues/36074#issue-691203711
This is a dupe of #1752, in which @jreback recommended a new issue be opened. I have a (hopefully) reproduce-able example which shows how using MultiIndex can explode RAM usage.
At the top I define a function to get the memory of the current Python process. This will be called as I create arrays to get the "true" space of each object.
I need some functions to construct a plausible MultiIndex which matches roughly the actual data I'm using.
Constructing the actual MultiIndex:
Based on this, my fake trip table is taking up about 171.7 - 121.7 = 50MB of RAM. I haven't done any fancy indexing, just initialized the table. But, when I call
trips.info()
(which appears to instantiate the array of PyTuples) this happens:My process's memory usage balloons to 723MB!. Doing the math, the cached indexer takes up 723.6 - 171.7 = 551 MB, a tenfold increase over the actual DataFrame!
For this fake dataset, this is not so much of a problem, but my production code is 20x the size and I soak up 27 GB of RAM when I as much as look at my trips table.
Any performance tips would be appreciated; the only thing I can think of right now is "don't use a MultiIndex". It sounds like a 'proper' fix for this lies deep in the indexing internals which is far above my pay grade. But I wanted to at least log this issue with the community.
Output of
pd.show_versions()
: