spencerahill / aospy

Python package for automated analysis and management of gridded climate data
Apache License 2.0
83 stars 12 forks source link

Var.vars should accept Vars that are themselves functions of multiple Vars #3

Closed spencerahill closed 6 years ago

spencerahill commented 9 years ago

Currently, if Var is passed a value for func when instantiated, it then populates the vars attribute with the Var objects that are passed into the given function. But this is only supported for one level: the Var objects passed to the function must be stand-alone, i.e. don't require a function of their own.

The result is the unnecessary proliferation of functions. For example, my flux divergence calculations currently have MSE, DSE, c_pT, gz, and general versions. Each one takes as arguments all the arguments required for the pre-calculation (e.g. temp, hght, and sphum for MSE), computes it by a call to that function, then calls the general function with that output as the argument.

Possible solution: when a Calc object encounters a Var with nested functions such as this, it should, from bottom up, create Calc objects that returns the full timeseries of the specified function. At the next level up, i.e. which is taking in this computed data as one of its Var objects, the code that loads data from netCDF will have to be bypassed, since you already have the timeseries in hand that you want.

spencerahill commented 8 years ago

Data for any Var should be searched for in the following order:

1) As aospy-calculated ts data, saved to disk, matching the specified time/data type/etc. 2) As model-outputted data saved to disk. 3) If not 1) or 2), compute it by loading each of the Vars it depends on and executing the function.

This would lead to a recursive creation of Vars and their associated data until the data for the top-level Var is ultimately created.

spencerahill commented 8 years ago

@spencerkclark's comment on #44 :

An issue I've encountered, particularly for more involved calculations which involve many variables, is that the way the calculation needs to be carried out varies depending on the model used. For a concrete example take computing the net column heating due to radiative, sensible, and latent heat fluxes. In a comprehensive GCM (like AM2) there is a term due to reflected shortwave radiation (swup_toa); this does not exist in the gray-atmosphere idealized moist model. In addition there are differences in how to handle the evaporative heat flux from the surface in each Model. Currently I address this by creating two separate calc functions and two separate variables linked with each calc function. This is rather inefficient, and I end up with two differently named variables representing the same physical quantity.

Another somewhat related situation that I encounter is that in some cases a variable is already computed in the model, while in others a variable needs to be computed from other variables. This is especially tricky to deal with when that variable is involved in another function; again I need to create separate functions and variables for each Model.

I'm not aware of any current (or numpy-legacy) infrastructure that is (was) in place that works to address this issue. I don't have a particularly good solution to either of these problems in mind at the moment -- I'll add one here if I think of one, but I just wanted to raise this issue while we are thinking about the next refactor of the object library.

There is no such existing infrastructure to deal with any of these issues, but I'm definitely wanting to implement it. One potential (partial) route would be to have each model (or run or project) have a list of variables that it has native, and then somehow use that to determine which function is used to compute a given variable.

spencerkclark commented 8 years ago

Keeping track of what has been computed (and easily accessing those computed variables) could be a useful front-end feature as well. For instance, could the process of opening or loading a Run be operated as smoothly as opening an xarray.Dataset (which analogously can contain many different variables on different combinations of coordinates)? Printing the repr of a Dataset is handy, and it is really nice to be able to access variables using the dot notation (e.g. ds.temp).

Right now it is a little difficult to parse the directory structure that's created when a Calc is completed (e.g. {$Project}/{$Model}/{$Run}/{$Variable}/*.nc). If you know what you are looking for already it is OK (it would be relatively straightforward to check if a particular nc file existed or not), but if you wanted a broad overview of what was there (at the Run level) you would need to traverse each variable folder and parse out what each filename meant (which is cumbersome).

I feel like you've given this problem (of how to store computed values) some thought in the past (if I recall correctly even considered trying to keep track of additional metadata such as when a variable was computed and which version of a function was used). Have you thought about this more since the inclusion of xarray? Do you think it might be worth re-thinking how we store computed variables before trying to implement the more complicated back-end we envision here?

spencerahill commented 8 years ago

Thanks for bringing this thread/whole project back to life!

Keeping track of what has been computed (and easily accessing those computed variables) could be a useful front-end feature as well. For instance, could the process of opening or loading a Run be operated as smoothly as opening an xarray.Dataset (which analogously can contain many different variables on different combinations of coordinates)? Printing the repr of a Dataset is handy, and it is really nice to be able to access variables using the dot notation (e.g. ds.temp).

I really like this idea.

Right now it is a little difficult to parse the directory structure that's created when a Calc is completed (e.g. {$Project}/{$Model}/{$Run}/{$Variable}/*.nc). If you know what you are looking for already it is OK (it would be relatively straightforward to check if a particular nc file existed or not), but if you wanted a broad overview of what was there (at the Run level) you would need to traverse each variable folder and parse out what each filename meant (which is cumbersome).

I agree. That directory structure is also ultimately arbitrary: where does one stop making new directories? E.g. why not have sub-directories of {$Variable} for the years, months, input data type, etc.? Ultimately, I think this should be replaced with a better serialization model.

In the past, I had been thinking about using a formal database, and that still may be part of the mix, but I think you're onto something with the xarray model. Emulating it would require some thought, though. Does the Run object have access using the dot notation to every single saved computation? This could easily lead to 1000s of attributes, which intuitively seems not good. Or does the Run object have each physical Var as one attribute (e.g. am2_cont.t_surf), and within those objects are what's actually been outputted or computed (e.g. am2_cont.t_surf.djf1989-2012)?

I feel like you've given this problem (of how to store computed values) some thought in the past (if I recall correctly even considered trying to keep track of additional metadata such as when a variable was computed and which version of a function was used). Have you thought about this more since the inclusion of xarray? Do you think it might be worth re-thinking how we store computed variables before trying to implement the more complicated back-end we envision here?

The computation metadata ideas never got beyond "that would be cool someday" status (although that status remains), nor have I thought much about them since the switch to xarray. Yes, a re-assessment is sorely needed.

Does that all make sense? Let me know your thoughts.

spencerkclark commented 8 years ago

Emulating it would require some thought, though. Does the Run object have access using the dot notation to every single saved computation? This could easily lead to 1000s of attributes, which intuitively seems not good. Or does the Run object have each physical Var as one attribute (e.g. am2_cont.t_surf), and within those objects are what's actually been outputted or computed (e.g. am2_cont.t_surf.djf1989-2012)?

Very good point, a single attribute per every combination of variable / averaging-type would be very messy. I quite like your suggestion of keeping one physical Var per attribute at the Run level. In that form, a Run (among other things) could serve as a container of Datasets (which could be stored in separate files in a single folder associated with a given Run); in other words keep the current directory structure, but replace each Var directory with a Dataset.

A question that emerges for me is how far can / should we go to try and simplify the names of the attributes at the variable level (am2_cont.t_surf.???)?

I've tried thinking about how we could do this below, but having thought through it more, I'm not super enthusiastic about what I came up with (see the drawbacks I list at the end). I'll leave it just as a source for discussion, because I think we should consider making the attribute names simpler, but at the moment it's not clear to me how we could do it in the easiest way (i.e. try to use xarray as much as we can (don't reinvent the wheel), but use it for the right reasons).

Let me know what you think! Thanks.


Regarding specifying the properties of a particular computed DataArray, it might be nicer if we could use the DataArray.sel() syntax (or something like it) for those properties rather than have to specify it within the attribute name because:

For instance an alternative that xarray provides is that we could add a coordinate that would take String values:

In [1]: import xray
In [2]: import numpy as np
In [3]: example = xray.DataArray(np.arange(2), coords={'intvl_in': ['3hr', 'monthly']})
In [4]: example.sel(intvl_in='3hr')
Out[4]:
<xray.DataArray ()>
array(0)
Coordinates:
    intvl_in  object '3hr'

In this manner, a list of variable names in the Dataset returned by am2_cont.t_surf could be something as simple as: ['ann_av', 'ann_ts', 'ann_std', 'djf_av', djf_ts', 'djf_std'], with this other metadata under the hood.

A challenge associated with this method is that we likely wouldn't want to have multiple coordinates in a Dataset for a single attribute type. In other words if we had a variable computed only using '3hr' data, but wanted to add a variable computed using only 'monthly' data, we wouldn't want to have a coordinate intvl_in_a which had values ['3hr'], and a coordinate intvl_in_b with values ['6hr'] (one for each variable). We'd instead want a single coordinate intvl_in that had both values; variable A would return something like NaN for ds.A.sel(intvl_in='6hr'), and variable B would return something like NaN for ds.B.sel(intvl_in='3hr'). xarray handles this automatically by doing an outer join (but this is not necessarily a good thing -- see drawbacks).

In [1]: import xray
In [2]: import numpy as np
In [3]: example = xray.DataArray(np.arange(2), coords={'intvl_in': ['3hr', 'monthly']})
In [4]: example2 = xray.DataArray(np.arange(2, 4), coords={'intvl_in': ['3hr', '6hr']})
In [5]: ds = xray.Dataset({'A': example, 'B': example2})
In [6]: ds
Out[6]:
<xray.Dataset>
Dimensions:   (intvl_in: 3)
Coordinates:
  * intvl_in  (intvl_in) object '3hr' '6hr' 'monthly'
Data variables:
    A         (intvl_in) float64 0.0 nan 1.0
    B         (intvl_in) float64 2.0 3.0 nan

There may be some drawbacks with this method of doing things:

spencerahill commented 8 years ago

I agree, the current method of just appending strings to the file name is not a good way of storing the metadata. I think what matters most, and what is currently lacking, is having this metadata embedded within the file itself, as opposed to just in its file name. Right now the only information other than the coordinate arrays saved within the file is the variable name itself. So your proposal is definitely heading in the right direction. Getting this working seems more important than (and orthogonal to) the way the files are saved to disk in terms of directory structure etc.

This method requires that some value be placed in every available slot. While this is OK in the one-dimensional case listed above (where we just fill in one nan value where the variable is undefined for a particular intvl_in) it could get ugly for multidimensional data (particularly time series). E.g. if there were 2 intvl_in options, 2 pressure options, and 2 year range options, yet for the time series we only needed computed data from one set of those, then we could have an array of shape (2, 2, 2, 96, 144, 30, 100), 7/8 of it filled with NaN's. This would be very inefficient disk-space-wise (and memory-wise when reading it into python).

Yes, this is untenable. Those parameter combinations for which data hasn't been generated should just be a single NaN (or None or False or something analogous), and those parameter combinations that do have data should have an array whose only dimensions are physically meaningful. This doesn't sound possible within the framework of xarray coords.

So I'm back to wondering about using a formal database? As you noted to me once before, it would get unwieldy to store the data within the DB itself, and as such the DB would instead hold paths/pointers of some kind to the actual data on disk. So then the problem of the directory structure, file names, etc. still exists.

But effectively what we're doing is querying various categories of data, and it's all simply AND: get data with variable name == x AND year range==y AND dtype_in==z AND ... We don't necessarily need to expose the DB internals to the user -- i.e. no SQL knowledge necessary (or would we?). I am already out of my depth though; effectively zero meaningful experience working with databases.

spencerkclark commented 8 years ago

I think you may be right. Storing this textual metadata in a SQL database is likely the most straightforward way to serialize it in a query-able form (which is basically what I was struggling with in the above post). And, like you mention

Getting this working seems more important than (and orthogonal to) the way the files are saved to disk in terms of directory structure etc.

if we have some way of mapping metadata to filenames, it doesn't matter what we name the files at that point as long as the names are unique.

So, running with this idea for the moment:

In your experience working with databases, have you worked with an ORM (Object Relational Mapper) like SQLAlchemy in python before? I think that may be the way we want to go. SQLAlchemy basically allows you to interface with a database purely within python, so even within the codebase we have the option of not having to write any SQL (at least directly). If you don't use an ORM you'd have to pass all the SQL commands as strings to the database, which is a bit messy (especially if you are making queries with many (4+) conditions, which I suspect we may).

To move forward we would just need to decide on a data model. I don't think we'll have to worry about performance at all (since we likely won't have databases with more than a few 1000 rows and we won't be making many (100+) queries at once), but from a querying perspective one data model might make things simpler than another.

If we wanted each user to have a single database some very basic options would be:

  1. One large table: each row points to a file containing a specific data array. The table has columns that describe the Proj, Model, Run, Variable, intvl_in, pressure, time interval, filename etc. associated with the particular computation
  2. Hierarchical: a table of Proj's that has rows which point to tables of Models, which have rows that point to tables of Runs, which have rows that point to tables of Vars, which point to tables of Calcs

Option 1 (although it would be the simplest to implement) might be off the table if we want to store more metadata about Proj's, Models, and Runs other than just their names. Option 2 would handle that very well. I'd say I'm leaning towards option 2 at the moment, because it basically mimics aospy's current data model within python.

Pending your thoughts, I might try and re-familarize myself with SQL and SQLAlchemy this weekend by implementing a basic version of option 2 (but have it unlinked with aospy for now and just generate some synthetic Proj's to store). Depending on how far I get I might try and create a separate repo so you can see how it might work.

We don't necessarily need to expose the DB internals to the user -- i.e. no SQL knowledge necessary (or would we?).

Yes, I think we should be able to accomplish this by writing some methods that wrap database queries (and writes) such that no SQL or ORM knowledge would be required.

How does all that sound to you?

spencerahill commented 8 years ago

This is great overall. I wasn't being modest before -- I have effectively zero meaningful experience working with DBs in Python -- so I will happily defer to you on those details. That being said, I have heard about SQLAlchemy in that Python podcast, and from that and your description it sounds like the right choice, especially given the pure Python interfacing you mentioned.

Option 2 would handle that very well. I'd say I'm leaning towards option 2 at the moment, because it basically mimics aospy's current data model within python.

Yes I agree.

Pending your thoughts, I might try and re-familarize myself with SQL and SQLAlchemy this weekend by implementing a basic version of option 2 (but have it unlinked with aospy for now and just generate some synthetic Proj's to store). Depending on how far I get I might try and create a separate repo so you can see how it might work.

Yes, full speed ahead! I'll try to (for the first time) familiarize myself with SQLAlchemy as well at some point. And yes, best to get the basic mechanics down before trying to integrate into aospy.

spencerkclark commented 8 years ago

This is great overall. I wasn't being modest before -- I have effectively zero meaningful experience working with DBs in Python -- so I will happily defer to you on those details.

No worries, while I've used SQLAlchemy before, I still have a lot to learn as well. It will be a challenge to integrate it into aospy in a robust way, but it could be well worth it.

Yes, full speed ahead! I'll try to (for the first time) familiarize myself with SQLAlchemy as well at some point. And yes, best to get the basic mechanics down before trying to integrate into aospy.

Awesome! I put together a new repo with some work on doing this. I've tried to set things up so that it is fully self-contained (meaning that you can check out the repository and run the code locally by yourself without changing your $PYTHONPATH)1. Perhaps we should move further conversations on this issue over there.

I started by creating a "synthetic" version of aospy, which basically operates just as normal aospy would with regard to creating objects and metadata, but does not do any computations2. I then created an example Proj, some example Models, example Runs, example Vars, and example Calcs. The main.py script is then used to set up and execute the "computations."

When the main script is run, the Proj's, Models, and Runs are added to the database if needed. Then the "computations" are done; these Calc's are added to the database one by one. Each entry contains what its filename would have been if the computation were actually completed. Var entries are created as needed as Calcs are created.

Overall I'm encouraged. I think this should be doable, but there will be many kinks to work out along the way (and many decisions we'll have to make, particularly about the API). On that note I've added an ipython notebook (which you can view in Github) to the repo, where we can share examples of how the API could work using the example objects and database.

Let me know your thoughts and if you think I should modify the setup in the experimental repo. I'm still a bit of a novice when it comes to setting up packages.


1Please let me know if you have any issues getting it up and running.

2As a side note, I think this would actually be a useful mode to run aospy in for testing purposes down the road. Unfortunately the way I accomplished it here was by deleting large amounts of code, rather than anything formal that would preserve aospy's original functionality, but I figured I would bring this up as something to keep in mind as we refactor calc.py in particular.

spencerahill commented 8 years ago

Wow, thanks for putting this together so quickly. A great first step. I forked it, cloned it, and ran main.py without issue. Yes, let's move all further discussion to that repo. (In fact, this thread veered from the nominal Issue topic well before then!)

Thread on aospy-db basics: https://github.com/spencerkclark/aospy-db/issues/4

spencerahill commented 6 years ago

263 accomplishes the recursive functionality but does not implement saving/stashing data at intermediate steps for use by other Calcs and/or serialization. Those would be cool someday, but for now I think it's fine to leave this closed.