pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.32k stars 17.81k forks source link

ENH: Create limitation on HDF5 queries such that users do not go over RAM #14399

Closed evanbiederstedt closed 3 years ago

evanbiederstedt commented 7 years ago

A small, complete example of the issue

Create a "max_size" parameter for HDFStore.select() which stops the user from executing queries over RAM limit, or any limit set by user. At the moment, if users try to create a pandas df from this query larger than RAM, the script simply freezes.

This issue was discussed on StackOverflow today: http://stackoverflow.com/questions/39986786/how-to-limit-the-size-of-pandas-queries-on-hdf5-so-it-doesnt-go-over-ram-limit

To quote from the OP comments:

"" Let's say I try df = store.select('df',columns=['column1', 'column2'], where=['column1==5']) and it's larger than some limit in terms of RAM---if the limit is the limit set by the computer's hardward, the program will just freeze. Let's say I wanted to set an arbitrary limit, e.g. 4 GB. The HDF5 might be +TB or PB, so df could easily exceed RAM if a user were to query this object. What limitations could I put in place to stop "bad things" from happening? ""

# Your code here

Expected Output

Perhaps a warning could be thrown? A parameter should be set in HDFStore.select() which is a byte-size limit---over this limit, an error is thrown.

Output of pd.show_versions()

True for all versions.

# Paste the output here
jreback commented 7 years ago

and how exactly would you determine this a priori?

evanbiederstedt commented 7 years ago

@jreback The use case I have in mind is either (1) distributing third-party software or (2) setting a "check" to stop myself from making queries that are too large.

In the first case, I make want to set the max_size parameter in the source code, at least say 2 GB. If that's too difficult to measure (and I think it might be), let's try by rows. max_size could be set to 10**9 rows default, larger or smaller by user-set parameter.

In case (2), it's the same logic. I may not know a priori how large my query will be on a 200 TB HDF5 file, but I don't want it to cause my python script to freeze/overload my RAM. So, I set max_size equal to 10**6 rows. This stops me from making a query that's too large. In this case, I can query by chunks, etc.

jreback commented 7 years ago

I don't see much value in trying to a prior limit the query size - this would be up to the user. you can simply catch MemoryError.

otoh if you want to know how much you are going to select you can: http://pandas.pydata.org/pandas-docs/stable/io.html#advanced-queries - select_as_coordinates which gives you an indexer

though it would involve making multiple queries so generally no need to do this

finally you can do a selection and simply chunk the results - which can be sized to fit into memory

so this is a bit out of scope - though if someone wanted to push a reasonable solution would prob take it

mroeschke commented 3 years ago

It appears there hasn't been much appetite for this feature by the core team or community. Thanks for the suggestion but I agree memory checking shouldn't be a pandas specific task. Closing but happy to reopen if there's further interest from the community.