iiasa / ixmp4

A data warehouse for high-powered scenario analysis in the domain of integrated assessment of climate change and energy systems modeling
https://docs.ece.iiasa.ac.at/ixmp4
MIT License
10 stars 6 forks source link

Make common DB-table components Mixins #82

Closed glatterf42 closed 2 months ago

glatterf42 commented 5 months ago

As noted by @danielhuppmann in https://github.com/iiasa/ixmp4/pull/79#discussion_r1565829200, several table-classes in our DB layer have common columns/functionality. This will become especially obvious with optimization.Table and optimization.Parameter. However, I think it's better not to have Parameter inherit from Table because later changes to Table might then cause non-obvious changes to Parameter. Instead, I've opted for a mixin approach with this PR: I have extracted several common columns to ...Mixin classes in data/db/base.py and whichever subclass needs them can simply inherit from these mixins. This even works for @validates and relationships, at least in theory :)

In practice, I couldn't quite extract the relationships between Table and Column elegantly. Please see the TODO marker and if you have a good idea for that, let me know. Also, these changes changed the order in which some columns in the result of list() operations appear. See e.g. core/test_table but also api/test_model. However, the former is not an elegant solution either way and the second is hardcoded with magic numbers, which we probably wouldn't recommend for production code, either. So I think these changes are fine.

meksor commented 4 months ago

Hm so if we really want to do this then I would have a couple of nitpicks: The "Name" Mixins should probably be called "HasName" and "HasUniqueName". The timestamp mixin might soon include a "updated_at/by" field so I would call it "HasAuditInfo" and include the functionality required to set those fields in the mixin. I'm not sure if it is best to just add a method "set_audit_info" or if we should use the sqlalchemy signal system for that, depends on what works best.

The "HasSomething" naming convention is straight from the sqlalchemy docs page on mixins: https://docs.sqlalchemy.org/en/20/orm/declarative_mixins.html

"OptimizationDataMixin" should probably be in optimization/base not top level base.

It also seems like at least "IndexSet" (also "Scalar"?) has now lost it's audit info, and that is kind of what I see as the main problem with this design pattern. In this case (other than with the repositories) there is no abstract base class or protocol for the model preventing you from forgetting to implement things. So while the code now might be "easier to read" it is only so when you are not trying to read code hidden in some mixin in some other file somewhere...

meksor commented 4 months ago

I also generally feel that all the mixins comprised only of a single added column are better represented by customizing the type_annotation_map (https://docs.sqlalchemy.org/en/20/orm/declarative_tables.html#customizing-the-type-map) or by just declaring a global variable in the db module. For example:

class AClass(...):
   run__id: types.RunId

or

class AClass(...):
   run__id: types.Integer = db.run__id_fk

But then again, I would not do this at all except for the OptimizationDataMixin.

glatterf42 commented 4 months ago

I see your point about potentially making the code harder to read because one has to visit the location of mixin definitions to see what they actually contain. That's why I kept all of them in data/db/base.py. If you prefer, we can call every Mixin "Has...Mixin" and I can try to use the type map annotations as you suggest, though this would still require us to write name: types.string (or so) everywhere explicitly, right? With the audit information, especially if we intend to expand it in a way that adds these columns to every table definition that already has creation information, I think making this a Mixin becomes increasingly useful. What makes you say that IndexSet and Scalar are missing this information? For them, I added the TimestampMixin to the optimization/BaseModel, which both inherit from, so these columns should be inherited, too, if I'm not mistaken.

meksor commented 3 months ago

I see your point about potentially making the code harder to read because one has to visit the location of mixin definitions to see what they actually contain. That's why I kept all of them in data/db/base.py.

Yeah, doesnt really matter if they are in the same file or not you have to jump between files in any case...

If you prefer, we can call every Mixin "Has...Mixin" and I can try to use the type map annotations as you suggest, though this would still require us to write name: types.string (or so) everywhere explicitly, right?

Yes, imo thats a good thing. I would omit "Mixin" at the end for the things that are simply data-related, but a matter of taste i guess.

With the audit information, especially if we intend to expand it in a way that adds these columns to every table definition that already has creation information, I think making this a Mixin becomes increasingly useful.

I guess so, yeah. But I still think the functionality to set these fields should be in the Mixin as well. Not only are we seperating functionality from its data but right now we are even associating it with /unrelated/ data.

What makes you say that IndexSet and Scalar are missing this information? For them, I added the TimestampMixin to the optimization/BaseModel, which both inherit from, so these columns should be inherited, too, if I'm not mistaken.

Sorry I was mistaken I overlooked the inheritance in base.py. But this is also exactly my problem with this change... I now have to jump between THREE files to figure out whats going on...

glatterf42 commented 3 months ago

Okay, please help me understand what's going on here. I've been trying to adhere to the DRY principle with this PR and I think the proposed solution works well. True, you have to look at multiple files to fully understand what's going on, but I'm not sure this is a problem or that I know a similarly DRY solution without this. After all, my goal here is to make some common attributes more abstract so that they can elegantly be included in several locations.

In our current implementation, you already have to jump between two files to fully understand our table definitions: the table's file and data/db/base.py. The latter includes information such as the id Column, arguably the most important column of a table. Wherever possible, I tried to expand this same file, so that the number of files you have to study to understand things wouldn't increase by much. In one case, this wasn't the most elegant solution, I thought, since all optimization tables share attributes just between them (such as table_prefix, which is already contained in another file, data/db/optimization/base.py). By changing the inheritance in this common file, I wanted to make the code even more DRY while still not really increasing the number of files you'd have to look at. But if this file is one too many, we can of course change the inheritance of HasOptimizationName and HasAuditInformation so that every optimization_ table directly inherits from them. I think this discussion also points to a deeper question: how easy should it be for outsiders to understand (and possibly change) our code? From my experience, I am used to having to look at various places to understand the structure of some code completely. That's not even a bad thing, I would argue, as long as the code is not unnecessarily complex: we probably want people who make changes to our code to work thoroughly and understand what is going on before making changes rather than enabling them to make quick changes that break or duplicate stuff without realizing so. So from my perspective, I would rename the mixins as you suggest and possibly adapt how the otimization_ tables inherit them, but essentially keep this PR as it is. We could also include some kind of docstring for the tables that lists all attributes or so (maybe we can use the existing abstract classes for that) so that all of them are immediately visible. We could even state where they are inherited from, if that's desired.

As for the audit information, I'll have to look at the code more closely again. Your latest comment sounds like we already have the functionality for setting these fields, in which case it should of course be included with the Mixin. If we already have it, I'll try to make it happen; if we don't already have it, please clarify: are you saying we can't have the functionality in the Mixin?

meksor commented 3 months ago

Okay, please help me understand what's going on here. I've been trying to adhere to the DRY principle with this PR and I think the proposed solution works well. True, you have to look at multiple files to fully understand what's going on, but I'm not sure this is a problem or that I know a similarly DRY solution without this. After all, my goal here is to make some common attributes more abstract so that they can elegantly be included in several locations.

Yes and I have replied by suggesting other ways to obtain the same result. Is there really so much of a difference from a DRY perspective between always inheriting some class or always having a declaration in the class body? (Other than that the former requires more refactoring for a change?)

In our current implementation, you already have to jump between two files to fully understand our table definitions: the table's file and data/db/base.py. The latter includes information such as the id Column, arguably the most important column of a table. Wherever possible, I tried to expand this same file, so that the number of files you have to study to understand things wouldn't increase by much.

Yes thats true, but that one field is required for all models so the Repository functionalties/api paradigms work. It also literally applies to all of the models so in essence its a ground truth and not something like a mixin. I imagine a world where all models inherit from "HasId" and I dont like it.

I think this discussion also points to a deeper question: how easy should it be for outsiders to understand (and possibly change) our code? From my experience, I am used to having to look at various places to understand the structure of some code completely. That's not even a bad thing, I would argue, as long as the code is not unnecessarily complex: we probably want people who make changes to our code to work thoroughly and understand what is going on before making changes rather than enabling them to make quick changes that break or duplicate stuff without realizing so.

Hmm, maybe... It does kind of seem like the approach super market planners use to get people to visit every single aisle even though they only need a few things ^^ I think the goal should be to make the code as simple as possible, i think the mean of "simple" however is not easily defined. I think this mixin business makes the code mostly more complex actually. From my perspective:

Yes the base.Creator class has a function called get_creation_info which is used here and there. I think for the mixin, the best approach would be to have methods that set these fields. Unfortunately you will have to pass these methods "backend.auth_context" or a username which is not super elegant, but not so bad i think.

glatterf42 commented 2 months ago

Thank you @meksor, I'm afraid I will need a bit more clarification. I see your point about NameMixin, OptimizationNameMixin and RunIDMixin and have converted them to Annotated objects in db/__init__.py so that they can more explicitly be mentioned in our table classes without having to repeat the details. Please note that for the optimization tables, I still kept name and run__id only in the common base class BaseModel so as to avoid repetition.

However, I think I don't fully understand what you mean by making OptimizationDataMixin (and potentially UniqueNameRunIDMixin) common base classes. I moved them from their previous location to data/db/optimization/base.py, where one inherits from the other as they become more specialised. Essentially, as they are now, the optimization BaseModels are named after the smallest common denominator. This already seems to go against everything I've read in https://python-patterns.guide/ that you once recommended, so I'm not sure this is what you were talking about. Something in this configuration is still wrong, too, causing all tests to fail. The error message I receive locally says that I need to provide an inheriting condition between 'optimization_uniquerunidbasemodel' and inheriting table optimization_optimizationdatabasemodel', which I have not looked into further at the moment because I think I'm misunderstanding you. I have also thought about making these just classes, without inheritance from each other. Then we would have an OptimizationDataBaseModel, which e.g. optimization_table could inherit from in addition to its usual BaseModel, but this would simply be a Mixin without calling it such. So please tell me: what exactly did you have in mind for making these common base models?

At the moment, I have left the TimestampMixin untouched. For what we currently have -two columns that should always have the same type etc-, sqlalchemy's docs seem to recommend the solution we now have for Name and UniqueName. Alternatively, we could follow this approach -called 'Augmenting the Base'-, which does again look like a Mixin. We could also make the columns type annotation map entries and decide later (when we gain additional functionality) how to handle that, but this feels as good a time as any. Please let me know if you have any preference here.

meksor commented 2 months ago

Of course.

However, I think I don't fully understand what you mean by making OptimizationDataMixin (and potentially UniqueNameRunIDMixin) common base classes. I moved them from their previous location to data/db/optimization/base.py, where one inherits from the other as they become more specialised. Essentially, as they are now, the optimization BaseModels are named after the smallest common denominator. This already seems to go against everything I've read in https://python-patterns.guide/ that you once recommended, so I'm not sure this is what you were talking about.

Yes, I agree, I think this is not a good idea and we should not abstract this in that way, the code you proposed has the exact same problem ("Essentially, as they are now, the optimization BaseModels are named after the smallest common denominator."). I think this is a problem with where you are trying to abstract and not the abstraction pattern. And additionally I would argue the proposed code is also actually preferring inheritance over composition since you are getting rid of one attribute "an object composition" and substituting it with a literal inheritance. The first bold appeal in this guide is "Favor object composition over class inheritance."

I sent you this site under the explicit pretense that all these things are not to be done without compromise. I think there is no /programming bible/ and there never will be. We always have to keep the pragmatics in mind imo. I also think that's what's going on here: we are doing this because we want to DRY under any circumstance, are we not? I read these things to train my intuition about building software and diversify my approaches, not to use them as rule-books. Specifically as far as I remember we were talking about the repository classes:


class RegionRepository(
    base.Creator[Region],
    base.Deleter[Region],
    base.Retriever[Region],
    base.Enumerator[Region],
    abstract.RegionRepository,
):
    ...

Here this approach makes sense in my opinion. Though it is still not what people usually mean by composition, and if you remember: we identified the mixin pattern as described as a "dodge" in that text. Why does this "composing classes together on the top level" make sense here? Well, because the functionality in these classes (they are technically not even mixins, since they all inherit from BaseRepository) works for /all/ repostiories and this approach makes it way easier to implement new functionality without enabling it by default for /every/ model.

Want to add bulk insert functionality? Add the BulkInserter and for explicity and type hinting: add the methods to the class body (although im not sure if that last part really makes sense or is required, i dont remember). Want to remove deletes? Remove the Deleter and the delete method. Want custom behaviour? Just add some methods to the class.

So why not have CreatorDeleterRetrieverEnumeratorBase for RegionRepository and UnitRepository and where ever else it might fit? Well obviously, as soon as one of the child classes of that abomination needs to add or remove some functionality, you need to make sure you don't accidentally change something else at the same time and in the worst case you will end up with a file of boiler plate code that just stinks of spaghetti and is helping noone. It's also just way easier to read imo.

So what would composition look like? Well, i guess it would mean that the Repositorys __init__ function would instantiate a few sub-objects so that we can call:

We also use actual composition already sometimes: I would argue that the Backend class is composed of all the repositories. In the previous ixmp Backend was a huge class that defined all the methods from all the current repositories which looked very confusing and pollution-prone to me. I would argue the filter_class attr on the repository is composition, and it works well. If you set it, it's used to select the model, otherwise you have to define the select() method.

Alright, I felt like it needed this tangent to get us on the same page.

In the former proposed code we can see that all of the sudden, functionality from the part of the program that was explicitly put in the optimization module ended up in the top-level base module. We can also see that you had to use this declared_attr.directive thingy because now that you have put all these fields in different classes they are no longer accessible from other class body scopes. (and maybe because sqla needs you to use it for something in the inheritance mechanism in sqlalchemys BaseModel)

class UniqueRunIDBaseModel/Mixin(BaseModel):
    @declared_attr.directive
    def __table_args__(cls) -> tuple:
        return (
            # Mypy recognizes run__id as int, while it should be a MappedColumn[int]
            db.UniqueConstraint(cls.name, cls.run__id),  # type: ignore
        )

We also see that now the normal BaseModel is used once, the UniqueRunIDBaseModel twice and OptimizationDataBaseModel once (arguably, better just called OptimizationData). I understand OptimizationData will be used by more classes, and it is quite extensive. But for UniqueRunIDBaseModel is it really worth it?

And now that I notice it: Please keep the current casing conventions for abbreviations RunID should be RunId.

I have also thought about making these just classes, without inheritance from each other. Then we would have an OptimizationDataBaseModel, which e.g. optimization_table could inherit from in addition to its usual BaseModel, but this would simply be a Mixin without calling it such. So please tell me: what exactly did you have in mind for making these common base models?

I honestly don't know how to specify this further here. May be best to talk in person if there is still open questions after this?

Something in this configuration is still wrong, too, causing all tests to fail. The error message I receive locally says that I need to provide an inheriting condition between 'optimization_uniquerunidbasemodel' and inheriting table optimization_optimizationdatabasemodel', which I have not looked into further at the moment because I think I'm misunderstanding you.

We've had this before. I think __abstract__ = True is missing. Sqlachlemy is trying to make actual tables out of your models since they now inherit BaseModel.

At the moment, I have left the TimestampMixin untouched. For what we currently have -two columns that should always have the same type etc-, sqlalchemy's docs seem to recommend the solution we now have for Name and UniqueName.

I'm not sure if that link is wrong or I'm not seeing what you mean? This only tells us how to do what I proposed above with the type_annotation_map with just type hints. No mention of multiple columns? It also doesnt recommend anything, just tells us that it is possible. No preference from my side on which approach - type_annotation_map or this one, both are ok.

Alternatively, we could follow this approach -called 'Augmenting the Base'-, which does again look like a Mixin. We could also make the columns type annotation map entries and decide later (when we gain additional functionality) how to handle that, but this feels as good a time as any. Please let me know if you have any preference here.

Yes, as said, I think the audit/creation info is the only thing that kind of makes sense as a mixin. It is used everywhere in the code base, so it can go on the top level. It has some potential for encapsulation, these fields need to always be set together and that can be done with a method in the mixin, seperating concerns further from the repository class that set them before:


class RegionRepository(...):
    ...
    def add(self, name: str, hierarchy: str) -> Region:
        region = Region(name=name, hierarchy=hierarchy, **self.get_creation_info())
        self.session.add(region)
        return region

becomes:


class RegionRepository(...):
    ...
    def add(self, name: str, hierarchy: str) -> Region:
        region = Region(name=name, hierarchy=hierarchy)
        region.set_creation_info(self.backend.auth_context)
        self.session.add(region)
        return region

hth Alt Text

meksor commented 2 months ago

PS: More typing and type system concepts that might be interesting or relevant at your leisure with some grains of salt. https://en.wikipedia.org/wiki/Liskov_substitution_principle https://en.wikipedia.org/wiki/Algebraic_data_type

glatterf42 commented 2 months ago

Thanks for your patience and sharing your knowledge! I tried to address all your recommendations now and the tests are passing locally. Maybe to highlight some changes: