spencerkclark / aospy-db

Adding a database to store output metadata to a hollow version of aospy
1 stars 1 forks source link

Would a pure pandas (or xarray) backend be feasible/potentially useful? #9

Open spencerahill opened 8 years ago

spencerahill commented 8 years ago

Occurred to me that this could be right-sized solution for relatively small projects. Something like a single DataFrame with one Column for each unique Model/Run/Var combination, and a single Index for each unique output datatype, with the contents of each cell being either empty or a path to the location of that file serialized on disk (as netCDF).

Or could potentially implement the same within xarray using Dataset. Potentially split into multiple DataFrames or Datasets, e.g. by Model or Run.

spencerkclark commented 8 years ago

Have you thought about using a DataFrame with multiple levels? My gut says this would be the way to go, because it would allow us to query at different levels of the hierarchy (rather than need to know exactly which Model/Run/Var you wanted to look at). I think you could have Model, Run, Var, and DataType levels; this would lead to a Cartesian product number of columns (kind of messy, I know...), each with a single cell below them (either empty or containing a filename). One then could arrive at the structure you are envisioning by using df.stack(level=['DataType']), but have additional flexibility in how you would want to view the data. With a multiple level DataFrame, I think it would cleaner to code up queries for things like "all computations involving MSE for a given Run" or "all annual average computations for a given Run" etc.

I think this is a solid lightweight solution -- it gets complicated if we want to store additional metadata (other than names) of the aospy core objects, but as a one-way1 structure to track things that have been computed I think it works and would be useful for things like interactive data-retrieval and plotting. It's also great that it leverages existing dependencies of aospy, so it could be a good fall back if users do not want to install sqlalchemy.


1If we have all metadata from the core objects (not yet true for even the sqlalchemy version, but I think could be eventually), the backend could potentially be two-way (by this I mean that aospy core objects could be recovered completely from the database and so could even be deleted from a user's aospy_user directory and reconstituted if needed later). This is in contrast to one-way, where one cannot start from the backend and drive a computation in aospy.

spencerkclark commented 8 years ago

For serialization maybe we could think about using feather? It's possible netCDF could work, but I'm not sure how it would handle multi-level DataFrames. If it works I think netCDF would be ideal, however.

spencerahill commented 8 years ago

It's also great that it leverages existing dependencies of aospy, so it could be a good fall back if users do not want to install sqlalchemy.

Yes this was partly what made me think of it.

I wasn't even familiar with the multi-level DataFrame capability, which sounds like a great solution. Thanks!

For serialization maybe we could think about using feather? It's possible netCDF could work, but I'm not sure how it would handle multi-level DataFrames. If it works I think netCDF would be ideal, however.

Feather is one option, although then we're not achieving the no-extra-dependencies goal. I was actually thinking just pickling via pandas.DataFrame.to_pickle. In order to go netCDF, the most straightforward pathway would seem to be to be pandas.DataFrame.to_xarray and then convert to netCDF. Or there's the to_hdf DataFrame method, but I don't have any experience with (non-netCDF) HDF data.

So basically, not sure, and this isn't just a side-issue: a backend is not viable without the serialization part working well.

spencerkclark commented 8 years ago

I hadn't thought about pickle -- I think that might be the only option that fits all our specifications. I did some experimentation and so far it's the only serialization method I've tried that both

The only drawback of pickle is that it is not human-readable and cannot be read in through other software, but I don't think that's a huge deal in this case.

spencerahill commented 8 years ago

Thanks for digging into that. Yes, pickling is definitely sub-optimal but, based on your notebook, seems to be the only viable means within the no-extra-dependencies world. So we'll call that decision made.

I'd say let's get the SQLAlchemy backend relatively polished, and then we can return to this, no?