pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.55k stars 17.9k forks source link

DOC: HDF access limitations and Comparison of SQL/HDF #9446

Open rnelsonchem opened 9 years ago

rnelsonchem commented 9 years ago

I think that the documentation for HDFStore could be more explicit about the limitations of accessing data from an HDF file. For example, it is not at all clear from the docs that when selecting several columns of data from a saved table that the entire table needs to be read into memory. Specifically, one might be tempted to do the following to select 3 columns from a very large table:

df = h5.select('test', columns=['col2', 'col7', 'col128'])

However, that process will potentially read a very large portion (if not all) of the table into memory, which is obviously a problem if the table doesn't fit into memory. Instead, one should really do something like this:

columns = ['col2', 'col7', 'col128']
df = pd.concat([tmp for tmp in h5.select('test', columns=columns, chunksize=50000)])

I think it also might be helpful to include some sort of comparison of HDF and SQL solutions for table storage. Maybe we could solicit use-cases from the mailing list for successful applications of one or the other. I've been using HDF files for several years, though never before with tables containing many, many columns, which tripped me up in my example above. I almost turned to SQLite, because the above selection is very simple and has a reasonable memory usage footprint:

engine = sqlalchemy.create_engine('sqlite:///test.db')
df = pd.read_sql_query('select col2,col7,col128 from test', engine)

However, I profiled this and it is several times slower, so I'll probably stick with HDF because I am much more familiar with the format and prefer the access speed for large tables.

Anyway, just a couple of thoughts. I'm happy to try to put some of this into the docs, if people think it is appropriate.

jreback commented 9 years ago

you should have a look here: http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations

as to your first point - this is a known limitation of row oriented tables and explicit in the PyTables docs - If it's not in the caveats section you could certainly add it

rnelsonchem commented 9 years ago

@jreback I will look into adding something about this limitation in the caveats section and a couple of notes in other sections. However, I looked around in the PyTables documentation, and I couldn't find an explicit reference to this issue with HDF files. It would be nice to include a link if you know of one.

Regarding the comparisons issue... I had looked at that performance results page. However, the difference is not so striking if you only need to select a small subset of rows from a larger-than-memory data set. For example, on my machine I found that HDF outperforms SQL by a factor of ~3 when selecting a subset of rows from a large table: http://nbviewer.ipython.org/gist/rnelsonchem/e6fa80e71365a0981c59 Also, the SQL selection example is pretty straight forward, but I had to do a bit of tinkering to get the HDF example to work without overloading my memory. Still, a factor of three is a factor of three.

I guess I wasn't thinking of read/write timing performance comparison, exactly, but I thought there might be use cases where one solution has advantages over the other. Pandas has nice support for both options, but from the docs/performance results, I get the impression that HDF is preferred in every case. I haven't worked with SQL much, so I wasn't sure if there were cases that this wasn't true. (For example, it seems like things like merging/joining require a bit of extra work with HDF.) If so, it might be useful for new users to have some guidance on successful use cases for the two solutions. Perhaps as another section in the cookbook alongside the HDF/SQL performance comparison or even a link to a preexisting page. I did see this discussion, which is a start: https://groups.google.com/forum/#!topic/pydata/S3kLxyrizkI

Just a thought.

jreback commented 9 years ago

see the read method here: http://pytables.github.io/usersguide/libref/structured_storage.html#table-methods-reading

You are dealing with a row-oriented structure so by definition it will read in all columns (exception for a single column using the special method).

You may want to look at using bcolz a columnar store. blaze and into provide easy access to these types of structures (as well as conversion between many other types).

Generally HDF5 is vastly superior to anything SQL-ike when modify/update is not a concern and data is mostly read/append only. You have to be careful about the limitations inherent in the data storage formats though. Pandas provides access to these features, but users are in charge of the data arangements.

So, additional docs are always helpful, as an update to the SQL comparison. However, pandas can't try (or should) provide guidance in all but the common cases.