Cleaner abstraction of uniqueness checking and DB row instantiation

spencerkclark commented 8 years ago

This pull request offers some significant improvements:

Basic fix to #1 -- I haven't added all attributes of the Region object yet, but that's pretty straightforward with this PR.
Substantial work on #5 -- I think I've come across a solid solution for uniqueness checking that generalizes easily to all aospy objects (present and future). It requires that each aospy core object implements a hash function and is one-to-one with a row in a database (not currently the way all aospy core objects are implemented), but I think these are good design decisions, and things that ultimately should be implemented in aospy proper eventually. More importantly (although I've yet to do comprehensive testing beyond running the test suite twice), the uniqueness checking works for all objects unlike the code currently in the develop branch. Anything that comes up in comprehensive testing I think will be a simple fix around what's now there.

Some things that still need to be done are:

[x] As mentioned above, some comprehensive testing of the uniqueness checking
[x] A delete function for removing objects from the database
[x] Track the filename attribute in Calc
[x] Comprehensive testing of SQLAlchemyDB.delete function
[x] Ability to turn on and off database tracking for select aospy objects other than Calc objects
[x] Remove unused code from aospy_synthetic core objects
[x] Ability to track Units objects in a separate table in the database referenced by Var
[x] Testing of __hash__ functions for each aospy core object (make sure they provide unique results when you expect them to)
[x] Docstrings for important methods in db submodule
[x] Comprehensive testing of tracking flag in aospy_synthetic core objects

Once these outstanding issues are addressed, I think we can move on to working on the front-end (i.e. querying and returning data) features.

@spencerahill you are welcome to comment on any of these changes if you have time -- otherwise I will go ahead and merge.

spencerahill commented 8 years ago

@spencerkclark I will definitely take a closer look within the next week or so, but on initial glance I am very excited about this. Thanks!

spencerahill commented 8 years ago

I'm a little rusty on all this, so these are questions not necessarily about the most recent changes but about the overall structure/organization. I'm going to email you about going over the more substantive aspects offline...frankly most of this is too far outside my expertise to assess without some hand-holding!

The package structure is confusing to me. Why is abstract_db defined at the aospy level, but the SQLAlchemy backend defined in ./test/test_objs? I think the former makes sense and the latter doesn't. Actually, probably there should be a db sub-package, where both the abstract module and all concrete implementations go (with those also potentially being their own subpackages, if they require more than a ~couple modules.

What is the significance of the sqlite_config module name? Its contents look like they all use SQLAlchemy. Where does the sqlite come in other than DB_PATH? My impression is that

Is there any significance to the _synthetic aospy? Or was it just to avoid name clashes with the main package while you work with this stripped-down version?

PEP8 nitpick: modules should all be lowercase_with_underscores.py, which SQLAlchemyDB.py violates. Probably good (and maybe also PEP8?) to do the same with package and sub-package names, e.g. change SQLBackend also.

spencerkclark commented 8 years ago

I really appreciate you taking a detailed look at this.

The package structure is confusing to me. Why is abstract_db defined at the aospy level, but the SQLAlchemy backend defined in ./test/test_objs? I think the former makes sense and the latter doesn't. Actually, probably there should be a db sub-package, where both the abstract module and all concrete implementations go (with those also potentially being their own subpackages, if they require more than a ~couple modules.

My motivation for putting abstract_db at the aospy level was to make a template for potential future DB backends beyond SQLAlchemy (in other words for them to play nicely with what we implement in the core aospy objects to interact with the backend they would have to implement methods listed in abstract_db). When we talk offline we can discuss whether we think this is ultimately necessary. The abstract_db class does nothing more than strictly enforce that the backend implements the specified methods.

What is the significance of the sqlite_config module name? Its contents look like they all use SQLAlchemy. Where does the sqlite come in other than DB_PATH? My impression is that

I agree this is confusing and am all for changing these names. I guess I would lean towards something like sqlalchemy_config or sqlalchemy_setup for that particular module at the moment, but am open to suggestions.

Is there any significance to the _synthetic aospy? Or was it just to avoid name clashes with the main package while you work with this stripped-down version?

I guess the main distinction is that I've (rather sloppily) disconnected all file input and output methods. Thus all the core objects in aospy_synthetic simply store metadata and interact with the DB. For instance when calc.compute() is called no files are read in and no actual computation is done. But yes, it's mainly just to avoid namespace conflicts.

PEP8 nitpick: modules should all be lowercase_with_underscores.py, which SQLAlchemyDB.py violates. Probably good (and maybe also PEP8?) to do the same with package and sub-package names, e.g. change SQLBackend also.

Good call, SQLAlchemyDB is a rather arbitrary name (and in violation of PEP8 as you mention). I'll try and think of a better option. SQLBackend is an object (not a submodule) so in that case is it OK to leave it the way it is?

spencerahill commented 8 years ago

My motivation for putting abstract_db at the aospy level was to make a template for potential future DB backends beyond SQLAlchemy (in other words for them to play nicely with what we implement in the core aospy objects to interact with the backend they would have to implement methods listed in abstract_db). When we talk offline we can discuss whether we think this is ultimately necessary. The abstract_db class does nothing more than strictly enforce that the backend implements the specified methods.

Got it. I think the most logical structure will be to have a db or backends subpackage (i.e. aospy.db) that houses abstract_db as well as all concrete implementations. The modules/subpackages within that subpackage can then access the aospy-level objects via .. notation, e.g. from ..var import Var.

I agree this is confusing and am all for changing these names. I guess I would lean towards something like sqlalchemy_config or sqlalchemy_setup for that particular module at the moment, but am open to suggestions.

Cool, either sounds fine to me; I guess slight prefrence for _config.

But yes, it's mainly just to avoid namespace conflicts.

Ok -- just wanted to make sure I understood your motivations.

SQLBackend is an object (not a submodule) so in that case is it OK to leave it the way it is?

I was referring to the aospy-db/aospy_synthetic/test/test_objs/SQLBackend directory (which based on the above ultimately belongs in aospy.db. E.g. aospy.db.sqlalchemy_backend?

Thanks!

spencerahill commented 8 years ago

Thanks @spencerkclark, this is rapidly moving towards something really great.

One thing in addition to my latest batch of in-line comments: before the final merge, all of the new classes, functions, methods, packages, and modules need proper docstrings.

spencerkclark commented 8 years ago

👍 for doc strings -- looking forward to the day we have formal documentation!

spencerahill commented 8 years ago

Excited to see you making such rapid progress on this. I'll just hold off on reviewing this all until you let me know it's in a good place for it.

spencerkclark commented 8 years ago

@spencerahill when you get a chance, I think this is ready for another look. I have implemented all the things I listed in the checklist above.

I'll just reiterate here that I've made a few simplifications to the Calc core object

Make it one-to-one with a particular filename (the filename does not need to be unique, however; regional calculations can still be written to a single file).
(For now) have it depend on a single Run core object. Eventually we'll probably want it to be possible to involve multiple runs.

From a code structure perspective I'm open to any suggestions you have. In particular I wonder if you have any ideas regarding how to specify which attributes are tracked in the database and which are not. Currently I use the _metadata_attrs and _db_attrs dictionaries to do this within the DB row objects in sqlalchemy_config.py, but perhaps there is a cleaner way.

But anyway, take your time and please let me know if you have any questions. I realize this PR has gotten pretty big. Thanks!

spencerahill commented 8 years ago

Thanks, will do, hopefully Friday or over the weekend...good task for when all the GFDL computing infrastructure is down.

spencerahill commented 8 years ago

Ok, mostly just style/clarification stuff. This was a less thorough review than last time, but given the solid test suite I trust it's largely good to move forward. I'd say let's merge once these last bits are squared away.

Make it one-to-one with a particular filename (the filename does not need to be unique, however; regional calculations can still be written to a single file).

Just to clarify, you're saying that, unlike currently wherein a single Calc can write multiple output types, e.g. av, and std, in this case there's only one. Right? Was this just for ease of coding or is there an important conceptual/design issue?

(For now) have it depend on a single Run core object. Eventually we'll probably want it to be possible to involve multiple runs.

Ok. Was this just for ease of coding or is there an important conceptual/design issue?

I wonder if you have any ideas regarding how to specify which attributes are tracked in the database and which are not. Currently I use the _metadata_attrs and _db_attrs dictionaries to do this within the DB row objects in sqlalchemy_config.py, but perhaps there is a cleaner way.

I'm not sure I can provide much useful insight here. Maybe once I get more time to play around with the DB I'll think more about this.

spencerkclark commented 8 years ago

Sounds good. Thanks for having a look. I'll address these in my next commit.

To answer your questions:

Just to clarify, you're saying that, unlike currently wherein a single Calc can write multiple output types, e.g. av, and std, in this case there's only one. Right? Was this just for ease of coding or is there an important conceptual/design issue?

Yes, that is currently the way things are done. In particular the setup is "one filename to one Calc object" (that is the way that the hash function is currently set up to test for uniqueness). That's mainly just for convenience -- uniqueness of Calc objects could be determined based on a smaller subset of attributes (a filename as currently constituted contains a lot of information). However, we'd then require another database table of children of Calc objects (which would contain the filenames instead of the Calc objects themselves). This would be a simple one to many relationship (Calc to filenames), which is implemented numerous times (and tested) already.

Ok. Was this just for ease of coding or is there an important conceptual/design issue?

This is ease of coding as well. This requires a many-to-many relationship, but I think that should be OK. I haven't implemented such a relationship before, but I that seems like a fairly standard data model, so at the moment I don't see any conceptual / design roadblocks there either.

In the end I think the DB code can be adapted to however the core of aospy is designed. I haven't completely thought through the pros and cons of Calc being one to one with a file within the aospy core versus not, but that's probably outside the scope of this PR.

spencerahill commented 8 years ago

the setup is "one filename to one Calc object"

In the end I think the DB code can be adapted to however the core of aospy is designed. I haven't completely thought through the pros and cons of Calc being one to one with a file within the aospy core versus not, but that's probably outside the scope of this PR.

I agree. The plan for the Calc refactor was for there to be a unique object for each output file, c.f. https://github.com/spencerahill/aospy/issues/33. So potentially this logic could just be put at that level.

In fact, I'm wondering if we'll get diminishing returns working on the DB for now when Calc is soon to be refactored? Your call though. There's just a danger it seems of the actual aospy objects and the mocked ones here diverging and that requiring you to re-wire a lot of logic.

At any rate, massive thanks for this PR -- I'd say definitely merge once after this batch of commits. This is a huge step forward for the project.

spencerkclark commented 8 years ago

In fact, I'm wondering if we'll get diminishing returns working on the DB for now when Calc is soon to be refactored? Your call though. There's just a danger it seems of the actual aospy objects and the mocked ones here diverging and that requiring you to re-wire a lot of logic.

I agree -- my current plan is not to worry about experimenting with connecting the DB and the fully-featured aospy core until we do the refactor. The primary takeaway with this PR is the design pattern of the unified uniqueness checking and construction of DB objects from aospy core objects; this should be abstract enough to handle whichever paradigm we choose to go with in Calc.

In my mind, there are still some additional steps before connecting with the fully-featured aospy core that are fairly independent of what happens with Calc (so I think if I have time between now and the refactor, these would still be useful to pursue):

Addressing #8 (figuring out how we'd like to make abstract DB counterparts of the aospy core objects)
Experimenting with recovering aospy core objects seamlessly from the database (i.e. aospy objects in, aospy objects out)
Implementing a many-to-many relationship between RunDB and CalcDB.

I'll probably make separate issues for the last two bullet points eventually. Complex querying and beyond should probably wait until after we connect to aospy proper.

I really appreciate your help guiding me through the package structure and test-writing portions of this PR; those aspects are now significantly better than what I had before.

spencerahill commented 8 years ago

All sounds good. Will stay in touch throughout this month, and I'll be back in Princeton 26 July-2 August, so let's definitely meet up then. Traveling through ~14 August, so I will take on the refactor in earnest beginning that following week.

spencerkclark / aospy-db

Cleaner abstraction of uniqueness checking and DB row instantiation #7