spencerkclark / aospy-db

Adding a database to store output metadata to a hollow version of aospy
1 stars 1 forks source link

Clean up main.py #2

Closed spencerkclark closed 8 years ago

spencerkclark commented 8 years ago

Similar to @spencerahill's comments in a pull request in aospy I think it would be valuable to restructure the main.py script here to have as little of the parsing logic as possible and fairly explicitly create all the computations. We'd still want a fair amount of iteration to take place to populate the database with a fairly large number of rows, but the way main.py is currently set up is a bit opaque and more importantly it is not possible to simulate multiple iterations of combinations of parameters with one execution of the script:

e.g. if we want to fill the database with jas average data for three variables, but then only do djf averages for one of those variables. It is important to be able to do this to set up interesting tests for the queries. Right now this can be accomplished by running main.py once with one set of parameters, editing it, and running again, but it would be helpful if we could do this all in one go.

spencerahill commented 8 years ago

I am definitely on board with this.

We'd still want a fair amount of iteration to take place to populate the database with a fairly large number of rows

What's the motivation here? Performance? Even with all of the parameter combinations we're iterating over in its current form, is the number of rows even large enough to meaningfully test that? I ask because my intuition is to do the opposite: create the minimal number of aospy objects and parameter combinations necessary to test all of the desired functionality.

e.g. if we want to fill the database with jas average data for three variables, but then only do djf averages for one of those variables.

I agree. aospy.plotting tries to do this, but in a very confusing/hard-to-maintain way (not to mention poorly documented) that I wouldn't necessarily want to emulate here. Basically it uses nested lists to signify at what level within the plotting hierarchy (figure, axes within the figure, plot within one axes set, data needed for a plot) to apply the given parameter.

Perhaps the best entry point is to ask: how would the user specify a set of calculations like this, wherein it is not just a permutation of all the specified parameter options? I think we could create some simple container classes that could be used to somehow specify how a parameter option is meant to be interpreted. E.g. yr_range=PerRun(1983-1998, 1-10), would cause the two year ranges to be matched up to the two Runs specified.

spencerkclark commented 8 years ago

What's the motivation here? Performance? Even with all of the parameter combinations we're iterating over in its current form, is the number of rows even large enough to meaningfully test that?

Testing performance wasn't really what I had in mind here. More of what I was interested in was doing the same calculation(s) for a number of Runs within a Model. My bad for phrasing this as a "fair amount of iteration."

I ask because my intuition is to do the opposite: create the minimal number of aospy objects and parameter combinations necessary to test all of the desired functionality.

For sure, even now I probably have too many rows in the database for what we really need for testing purposes. I think before I add any more functionality I'll work to clean up the main script and set up some simple formal tests.

Perhaps the best entry point is to ask: how would the user specify a set of calculations like this, wherein it is not just a permutation of all the specified parameter options? I think we could create some simple container classes that could be used to somehow specify how a parameter option is meant to be interpreted. E.g. yr_range=PerRun(1983-1998, 1-10), would cause the two year ranges to be matched up to the two Runs specified.

This is a big question from an API standpoint with many sub-questions (e.g. spencerahill/aospy#9; spencerahill/aospy#30; what about inter-Run calculations?). At the moment it's hard for me to say what could work best. I think you're on to something with using some container classes, but once I go through the process of distilling the current main script to its minimal elements, maybe I'll have a clearer perspective.

spencerahill commented 8 years ago

Sounds good. Strictly speaking, in terms of testing that iteration is working, even just two of each object type is sufficient. (But that might not be enough to test more involved queries that you have mentioned).

Sorry I can't help more right now with the main script cleanup -- frankly you're stuck cleaning up the mess I created there (and it is definitely a mess).

spencerahill commented 8 years ago

what about inter-Run calculations?

This too could potentially work through the container classes, e.g. runs=PerCalc(run1, run2)` for a calc that requires two pieces of data to compute. Or it could be dict-like, so that the input variables are explicitly mapped to the objects from which their data will come. Just thinking out loud.

(And it should be generalized beyond runs: data should be allowed to come from different Models and potentially even Projs.)

spencerkclark commented 8 years ago

Starting a little further back, how tied to the obj_from_name.py methods used by main.py are you? The reason I ask is that I'm thinking about exploring the possibility of merging the database objects defined in aospy_db.py (e.g. dbProj, dbModel etc.) with the actual aospy objects (e.g. Proj, Model etc.). Within the database framework we can natively query for the objects by name, and (as far as I know) there is nothing that says the objects returned couldn't implement the same methods that are already implemented in each of the aospy classes.

I want to say that we could do this in such a way that it wouldn't change much from the user's perspective, but it could change some of these back-end processes (and possibly reduce the amount of code), but I'd need to try it first.

Is there any reason you think this would be a bad idea? I might give it a shot in a separate branch.

spencerahill commented 8 years ago

how tied to the obj_from_name.py methods used by main.py are you?

Not at all -- they feel crude to me, and I've kinda always wanted to do something better there.

I'm thinking about exploring the possibility of merging the database objects defined in aospy_db.py (e.g. dbProj, dbModel etc.) with the actual aospy objects (e.g. Proj, Model etc.).

I actually think there is a danger here of too tightly coupling the DB and rest of the functionality. I think the aospy classes are already doing too much (Calc especially) and need to be made more modular.

For example, from a quick glance, it looks like you've included some interaction with the DB within the init method of Model, which causes Model to be tightly coupled to the DB. What if someone wants to create a standalone Model, e.g. for exploratory or testing purposes, without dealing with a DB? Also, if we want to potentially support different data management backends based on user preference down the line, this is also problematic.

So that is to say, we need to come up with a clean interface between the aospy objects and their database counterparts. At the very least, the DB-related code that is now inside the aospy objects should be sectioned off into separate methods, with those methods being called e.g. by init based on a kwarg.

Does that make sense? This is still extremely useful, just trying to avoid getting ahead of ourselves and inadvertently creating problems for ourselves down the line -- much of the difficulties you're dealing with now are because of ill-advised design decisions I made previously. Thanks!

spencerkclark commented 8 years ago

I actually think there is a danger here of too tightly coupling the DB and rest of the functionality. I think the aospy classes are already doing too much (Calc especially) and need to be made more modular.

I hear you. I'll think about how to abstract things a little more. It was just bugging me that we need to provide the same attributes twice in two classes, but I guess it's something we'll have to live with. I was hoping there might be some sort of object inheritance functionality we could leverage,

What if someone wants to create a standalone Model, e.g. for exploratory or testing purposes, without dealing with a DB?

but if we want to support objects disconnected from a database (or simply not running with a DB period, which I guess should also remain an option) I'm not sure if that would be possible.

spencerkclark commented 8 years ago

Also, if we want to potentially support different data management backends based on user preference down the line, this is also problematic.

This requires that the SQLAlchemy objects returned by queries be transformed into some generic format before being returned to aospy, which ostensibly we'd be able to accomplish by transforming to aospy objects. That said, at least for Calc's there would not be a one to one correspondence between DB rows and Calc objects (since single Calc objects can produce more than one type of time reduction or regional calculation per instance). For the sake of modularity does it make sense to consider this in the broader refactor of Calc (i.e. require one Calc object per computation, not just per Var)? Then again, for performance that would require some sort of intelligent "scheduler" so that the same variables wouldn't need to be read in again and again (which could be a challenge to implement).

spencerkclark commented 8 years ago

Getting back to the primary motivation of this issue, pending slight changes to the import statements due to package re-structuring (which I think I'll tackle next), here is a minimal example for a main.py script that should be extendable to do everything we need for the time being:

from aospy_synthetic.calc import Calc, CalcInterface

import cases
import models
import example_projects
import variables

test = CalcInterface(proj=example_projects.example,
                     model=models.am2,
                     run=cases.am2_control,
                     var=variables.mse,
                     date_range=('0021-01-01', '0080-12-31'),
                     intvl_in='monthly',
                     intvl_out='son',
                     dtype_in_time='ts',
                     dtype_in_vert='sigma',
                     dtype_out_time='avg',
                     dtype_out_vert=False,
                     level=False)
calc = Calc(test)
calc.compute()

Let me know if you have any comments on this new main.py script. Ultimately this is just a stepping stone toward implementing formal tests (which I'll work on in the package restructuring). If you're cool with it, I'll consider this issue closed and recommend pushing further discussion about the API of creating Calcs back to aospy proper.

spencerahill commented 8 years ago

This requires that the SQLAlchemy objects returned by queries be transformed into some generic format before being returned to aospy, which ostensibly we'd be able to accomplish by transforming to aospy objects.

Yes. I don't think it's necessary to add an additional layer of abstraction in terms of objects between the DB and aospy. Instead, the additional abstraction needs to be in terms of classes that handles the interaction between the DB and aospy. In the direction of aospy to the DB, this would involve properly storing the aospy object within the DB. In the opposite direction, this would involve converting the specification for desired data (however it is provided) into a query that ultimately returns that data. (I'm not sure whether each direction should be its own class or there be one class that handles both directions.)

That said, at least for Calc's there would not be a one to one correspondence between DB rows and Calc objects (since single Calc objects can produce more than one type of time reduction or regional calculation per instance).

Good point. Right now, the datatypes aren't symmetric: we save the results of Calcs as netCDF files and then read them back simply as xarray objects. Granted, we save some additional metadata within the xarray objects (part of what makes them so useful), but there is no conversion back into a Calc (which, as you note, isn't necessarily 1-to-1).

For the sake of modularity does it make sense to consider this in the broader refactor of Calc (i.e. require one Calc object per computation, not just per Var)? Then again, for performance that would require some sort of intelligent "scheduler" so that the same variables wouldn't need to be read in again and again (which could be a challenge to implement).

I like this and had thought a little about it in the past (although I had forgotten it until just now): https://github.com/spencerahill/aospy/issues/4 and https://github.com/spencerahill/aospy/issues/33. Take a look at those, the latter especially, and see what you think.

spencerahill commented 8 years ago

Let me know if you have any comments on this new main.py script. Ultimately this is just a stepping stone toward implementing formal tests (which I'll work on in the package restructuring). If you're cool with it, I'll consider this issue closed and recommend pushing further discussion about the API of creating Calcs back to aospy proper.

Yes, this looks really clean -- feel free to close and we can continue the discussion within aospy...maybe spencerahill/aospy#33 is the most logical place.

spencerkclark commented 8 years ago

Great! Yes, I had forgotten about spencerahill/aospy#33 as well. I agree, let's move further discussions over there.