Closed evanbiederstedt closed 3 years ago
and how exactly would you determine this a priori?
@jreback The use case I have in mind is either (1) distributing third-party software or (2) setting a "check" to stop myself from making queries that are too large.
In the first case, I make want to set the max_size
parameter in the source code, at least say 2 GB. If that's too difficult to measure (and I think it might be), let's try by rows. max_size
could be set to 10**9 rows default, larger or smaller by user-set parameter.
In case (2), it's the same logic. I may not know a priori how large my query will be on a 200 TB HDF5 file, but I don't want it to cause my python script to freeze/overload my RAM. So, I set max_size
equal to 10**6 rows. This stops me from making a query that's too large. In this case, I can query by chunks, etc.
I don't see much value in trying to a prior limit the query size - this would be up to the user. you can simply catch MemoryError.
otoh if you want to know how much you are going to select you can: http://pandas.pydata.org/pandas-docs/stable/io.html#advanced-queries - select_as_coordinates which gives you an indexer
though it would involve making multiple queries so generally no need to do this
finally you can do a selection and simply chunk the results - which can be sized to fit into memory
so this is a bit out of scope - though if someone wanted to push a reasonable solution would prob take it
It appears there hasn't been much appetite for this feature by the core team or community. Thanks for the suggestion but I agree memory checking shouldn't be a pandas specific task. Closing but happy to reopen if there's further interest from the community.
A small, complete example of the issue
Create a "max_size" parameter for
HDFStore.select()
which stops the user from executing queries over RAM limit, or any limit set by user. At the moment, if users try to create a pandasdf
from this query larger than RAM, the script simply freezes.This issue was discussed on StackOverflow today: http://stackoverflow.com/questions/39986786/how-to-limit-the-size-of-pandas-queries-on-hdf5-so-it-doesnt-go-over-ram-limit
To quote from the OP comments:
"" Let's say I try df = store.select('df',columns=['column1', 'column2'], where=['column1==5']) and it's larger than some limit in terms of RAM---if the limit is the limit set by the computer's hardward, the program will just freeze. Let's say I wanted to set an arbitrary limit, e.g. 4 GB. The HDF5 might be +TB or PB, so df could easily exceed RAM if a user were to query this object. What limitations could I put in place to stop "bad things" from happening? ""
Expected Output
Perhaps a warning could be thrown? A parameter should be set in
HDFStore.select()
which is a byte-size limit---over this limit, an error is thrown.Output of
pd.show_versions()
True for all versions.