spencerkclark / aospy-db

Adding a database to store output metadata to a hollow version of aospy
1 stars 1 forks source link

Modularize DB Interactions #5

Open spencerkclark opened 8 years ago

spencerkclark commented 8 years ago

From @spencerahill in #2:

I actually think there is a danger here of too tightly coupling the DB and rest of the functionality. I think the aospy classes are already doing too much (Calc especially) and need to be made more modular.

For example, from a quick glance, it looks like you've included some interaction with the DB within the init method of Model, which causes Model to be tightly coupled to the DB. What if someone wants to create a standalone Model, e.g. for exploratory or testing purposes, without dealing with a DB? Also, if we want to potentially support different data management backends based on user preference down the line, this is also problematic.

So that is to say, we need to come up with a clean interface between the aospy objects and their database counterparts. At the very least, the DB-related code that is now inside the aospy objects should be sectioned off into separate methods, with those methods being called e.g. by init based on a kwarg.

This is a good goal to have from the start. I think it will require a few things to be addressed fully (please let me know if you have any thoughts or recommendations on these steps):

spencerahill commented 8 years ago

This sounds solid to me from a cursory reading. Let me know if our discussion in #2 and spencerahill/aospy#33 would substantially alter any of your thinking here.

spencerkclark commented 8 years ago

6 provides a start at addressing this issue, but some additional questions came up:

The API still needs work. Some things that should be addressed are:

  1. How should the user specify the backend used? Ideally they shouldn't need to provide an argument to every object they create (as is the case now). In general I think this should only need to be done at the project level (and then have all child objects inherit recursively).
  2. Giving the user the option of specifying via a keyword argument whether they want to track an object in the DB is partially set up via this pull-request, but I don't think is working completely. This again should be addressed with some sort of inheritance; any object that is a child (or lower) of an object that is specified not to be tracked, should also not be included in the DB.
  3. How will we handle querying in an abstract manner? A kitchen-sink type method is probably not wise (what I currently have in abstract_db is more of a placeholder). What are the most important queries that a user could make?

In terms of 3., I think we should recall our original motivation from spencerahill/aospy#3

Keeping track of what has been computed (and easily accessing those computed variables) could be a useful front-end feature as well. For instance, could the process of opening or loading a Run be operated as smoothly as opening an xarray.Dataset (which analogously can contain many different variables on different combinations of coordinates)? Printing the repr of a Dataset is handy, and it is really nice to be able to access variables using the dot notation (e.g. ds.temp).

I will need to think more about how we might be able to make this (or something somewhat like it) possible in a clean manner.

spencerahill commented 8 years ago

How should the user specify the backend used? Ideally they shouldn't need to provide an argument to every object they create (as is the case now). In general I think this should only need to be done at the project level (and then have all child objects inherit recursively).

Similar issues have come up for me a few times before. I think we should eventually look into using formal config files, e.g. the builtin configparser library. But in the meantime, yes I agree objects should inherit this from parent objects, and most often Proj would be the level at which this is defined.

Giving the user the option of specifying via a keyword argument whether they want to track an object in the DB is partially set up via this pull-request, but I don't think is working completely. This again should be addressed with some sort of inheritance; any object that is a child (or lower) of an object that is specified not to be tracked, should also not be included in the DB.

Sounds like you need some unit tests to be sure whether or not its working :)

How will we handle querying in an abstract manner? A kitchen-sink type method is probably not wise (what I currently have in abstract_db is more of a placeholder). What are the most important queries that a user could make?

This does seem tricky. Would it make sense to have the querying be implemented at the backend level, if it really is highly platform-specific? Or does that go against the whole point of the exercise? I don't know much here, so I'm merely just providing some food for thought.

spencerahill commented 8 years ago

BTW :+1: for this and #6

spencerkclark commented 8 years ago

Awesome, thanks for the feedback. I'm glad I'm on the right track here.


Similar issues have come up for me a few times before. I think we should eventually look into using formal config files, e.g. the builtin configparser library. But in the meantime, yes I agree objects should inherit this from parent objects, and most often Proj would be the level at which this is defined.

Cool, I wasn't aware of the configparser library -- that looks like an interesting option, but I agree, not super high priority at this stage.

Sounds like you need some unit tests to be sure whether or not its working :)

For sure, I think that should be next on my to-do list after addressing #1. It doesn't really make sense to get bogged down in querying until we're certain we can add and remove from the DB reliably.

Would it make sense to have the querying be implemented at the backend level, if it really is highly platform-specific?

This is a good question. What I'm leaning towards at the moment is, at the backend level, trying to frame things in terms of "what do we want to return based on some user's input?"

For example, one valuable use case of the DB would be to return all Calcs associated with computations computed for a particular run. Another might be a little more specific; maybe return all Calcs associated with computations for a particular run for a particular variable; you could go more specific still (say specify a particular time reduction etc.). Fundamentally, though these are all just the same query, but with different amounts of "and" conditions.

Maybe within the backend they could be implemented along the lines of backend.getCalcs(Run, **kwargs). For each of the above examples (which would be happening at the Run level) you would then just make calls like:

from runs import example
all_calcs = backend.getCalcs(example)
temp_calcs = backend.getCalcs(example, var='temp')
temp_djf_calcs = backend.getCalcs(example, var='temp', reduction='djf')

Even implementing a simple function like this would give us a lot to work with. E.g. you could easily extend this to a query at the model level, for instance, by just iterating over the child Runs; you could also look for a specific Calc (say some pre-computed data to use in another calculation).

So, as you suggest, this would not be trying to abstract generic querying across all DB platforms (which I'm sure you would agree is outside the scope of aospy; e.g. SQLAlchemy does this for different types of SQL databases). Instead the backend would only implement very aospy-specific queries (like the case listed above). As we think of more query-types that would be valuable for aospy, we can implement more, but I don't see a huge reason to be really general from the start.

Does that sound reasonable to you?

spencerahill commented 8 years ago

Yes, I think this is a great way of approaching it!

I have no idea, since I haven't had time to dig into the internals yet, but I'm wondering if it would be useful / interesting / fun to implement this via chained calls? E.g. in your last example temp_djf_calcs = backend.getCalcs(example).getCalcs(var='temp').getCalcs(reduction='djf'). Totally just thinking out loud.

But yes I think for aospy by far the most important use case is just a series of ANDs.

spencerkclark commented 8 years ago

I have no idea, since I haven't had time to dig into the internals yet, but I'm wondering if it would be useful / interesting / fun to implement this via chained calls? E.g. in your last example temp_djf_calcs = backend.getCalcs(example).getCalcs(var='temp').getCalcs(reduction='djf'). Totally just thinking out loud.

I like this idea; I'll see what I can do.