openfisca / openfisca-core

OpenFisca core engine. See other repositories for countries-specific code & data.
https://openfisca.org
GNU Affero General Public License v3.0
171 stars 75 forks source link

test(entities): improve docs, doctests, and typing #1034

Closed bonjourmauko closed 1 month ago

bonjourmauko commented 3 years ago

Part of #1061 Superseded by #1220 Superseded by #1252 Superseded by #1255

Documentation

Doctests

Typing

Deprecations

New features

Technical changes

Notes

  1. This PR makes an architectural choice of strong domain model / logic separation, following what has already begun with for example the split of Entity and Population, Simulation and SimulationBuilder, and so on. Although the changset is backward-compatible, future contributions enforcing this choice may introduce breaking changes.
  2. This PR introduces the domain-driven design notion of adapter, proxy, port, or repo: beyond the pure relationship between models, the behavioural logic between them is extracted to specialised models —in this case, querying variables is extracted from Entity to a _VariableProxy. This is just a subset of 1.
  3. This PR consolidates the notion of protocols, or duck-typing, or structural sub-typing: that means type-checks are done not against the actual implementation of a model, but against a protocol, that is, the equivalent of an interface, or an abstract model, but without impact at runtime.
  4. Finally, and more importantly, this PR introduces the notion of schemas. But, contrary to regular schemas used to validate, serialise, and deserialise data from/to models, these schemas are purely type-safe data objects, meant for type-checks, without impact at runtime.

Extended note on data-schemas

What if, instead of just using schemas as typed-dicts for type-checks, we used them at runtime to compulsorily:

a. check types at runtime? b. validate data at runtime? c. serialise/deserialise data —as suggested in #1071 ? d. enforce contracts, or domain logic —both in terms of data ("has to be >0") and representation ("tojson()")?

For
Against

Performance: whereas declarative data transmutation can even increase performance for individual situations —WebAPI—, a naive or out-of-the-box implementation (or anything with a complexity of at least O(n)) will certainly have a very penalising impact on performance for large population simulations.

This due to several reasons:

Possible workarounds:

MattiSG commented 3 years ago

Thanks @maukoquiroga.

This is marked as depending on #1033 but I see redundant changes, for example in the config file.

I would be in favour of having a single PR as they seem to address the same issue (https://github.com/openfisca/openfisca-doc/issues/244) and to improve documentation altogether. If you'd prefer not to for some reason, could we at least update the target branch to #1033? 🙂 The current changesets seem likely to yield conflicts.

bonjourmauko commented 3 years ago

This is marked as depending on #1033 but I see redundant changes, for example in the config file.

Corrected it ~to depend on #1038~ to depend on both #1038 and #1033, and changed base for clarity as suggested ~, the redundant changes are effectively just in the setup and config files, the rest has to be properly rebased as they're independent from #1033 (I will do shortly)~.

If you'd prefer not to for some reason, could we at least update the target branch to #1033? 🙂 The current changesets seem likely to yield conflicts.

I will rebase properly, that should get us rid of the conflicts 😃

I would be in favour of having a single PR as they seem to address the same issue (openfisca/openfisca-doc#244) and to improve documentation altogether.

I thought so initially as well, and I tried, then I decided to split this work by "submodule" and add doc as well as fixing the doctests. Here's what happened:

I think we probably won't be able to fix all the doctests soon, that's why I changed from an horizontal to a vertical (submodules) approach.

Happy to have an extended discussion on that with the community if you think that could be useful!

bonjourmauko commented 3 years ago

@benjello I'd like to have your opinion on this one.

There is no breaking-change, but a proposal of a more explicit interface between entities and variables.

However, there are still three things that I don't understand:

  1. Why do we need entities to know about variables? For you it is part of the domain model or just a shortcut for convenience? (BTW I just realised that there are also populations and group populations, but I don't yet know what they are for) ;

Anyway, I tried to make the one-to-many relationship more explicit by introducing Entity.variables, as there is already Variable.entity.

  1. Same for group entities and roles: if I understood correctly, there can be as many roles a there are entities inside a group entity, but that relationship is implicit as there is not actual relationship between entities and group entities.

  2. Finally, there are the famous sub-roles, which are just plain roles, added recursively to a group entity, as group_entity.SUB_ROLE, instead of for example `group_entity.roles.get("FIRST_PARENT"). Why?

benjello commented 3 years ago

@maukoquiroga:

  1. variables belongs to an entity (they are a characteristic of an entity). Entities help you describe the model. Populations are the actual "holders" of the data.

  2. I do not understand what do you mean by entities inside a group entity. There is an atomic entity (persons) and a group entity can have many person and as meany roles as there are different persons in those entities.

  3. IIRC sub_roles are convenient for initialisation but also here for legacy purpose. Sorry I do not recall well where they are still used (taking a look at France may give you hints).

bonjourmauko commented 3 years ago

@maukoquiroga:

  1. variables belongs to an entity (they are a characteristic of an entity). Entities help you describe the model. Populations are the actual "holders" of the data.

Great! So:

Is that it?

  1. I do not understand what do you mean by entities inside a group entity. There is an atomic entity (persons) and a group entity can have many person and as meany roles as there are different persons in those entities.

Ah OK so roles actually relate to concrete data holders (Person does not have a role, Alicia has).

  1. IIRC sub_roles are convenient for initialisation but also here for legacy purpose. Sorry I do not recall well where they are still used (taking a look at France may give you hints).

Sure.

bonjourmauko commented 3 years ago

This https://github.com/openfisca/openfisca-core/pull/857

benjello commented 3 years ago
  1. variables belongs to an entity (they are a characteristic of an entity). Entities help you describe the model. Populations are the actual "holders" of the data.

Great! So:

Yep

  • Entity 1 - * Population (so for example Alicia is a population, a concrete (data holder) member of a Person).
  • Entity 1 - * Variable (a person can have a salary, pay a tax and not another, be born) [the abstract rule]
  • Population - Variable (Alicia, Jorge, ... have x salary, pay the y tax, were born at specific instant).

Is that it?

Yes popualtions are the "simulation" side of the "model" (tax-benefit-system).

  1. I do not understand what do you mean by entities inside a group entity. There is an atomic entity (persons) and a group entity can have many person and as meany roles as there are different persons in those entities.

Ah OK so roles actually relate to concrete data holders (Person does not have a role, Alicia has).

But Household do have roles, household_head, child etc ? Roles are part of of the structure of a GroupEntity. And thus every Person population has a role.

bonjourmauko commented 3 years ago

@benjello So conceptually, the Population is the actual Holder of the data (concrete Entity). Holder then is just a caching system, or should be, isn't it?

benjello commented 3 years ago

@benjello So conceptually, the Population is the actual Holder of the data (concrete Entity). Holder then is just a caching system, or should be, isn't it?

@maukoquiroga Holderis more the concrete Variable than the concrete Entity. And thus is definitively a caching system.

bonjourmauko commented 3 years ago

@MattiSG @sandcha @benjello ?

benjello commented 2 years ago

I am in favor of more typing, more doctest but I am ,ot sure I can grasp all the implications of the changes ...

bonjourmauko commented 2 years ago

The added documentation is great! However, I don't understand the consequences of introducing an abstract base class system

Thanks @MattiSG !

Concerning the abstract class system, it is not one. As I've described it:

This PR consolidates the notion of protocols, or duck-typing, or structural sub-typing: that means type-checks are done not against the actual implementation of a model, but against a protocol, that is, the equivalent of an interface, or an abstract model, but without impact at runtime.

Structural sub-typing is described in PEP 544.

Quoting it:

At runtime, protocol classes will be simple ABCs. There is no intent to provide sophisticated runtime instance and class checks against protocol classes. (...) Protocols are completely optional:

  • No runtime semantics will be imposed for variables or parameters annotated with a protocol class.
  • Any checks will be performed only by third-party type checkers and other tools.

In other words, there are no consequences beyond type-checks —no runtime impact then.

To make the point, an Entity is to an EntityProtocol the same a List is to a Sequence: instead of type-checking for implementation, we type-check for behaviour (that is why it is nicknames "duck-typing").

This has, however, one major benefit: we remove circular module dependencies from the codebase, which has an actual positive impact in terms of modularity and testability.

and this “variable proxy” system, so I cannot approve that PR 😕

That is just an abstraction, or in DDD, a port.

As I see it has two impacts:

Beyond the "design" aspect of it, the actual recipe, the descriptor, is just standard Python —a huge deal of the function system, properties, and so on, are just syntactic sugar; behind-the-scenes they're just descriptors.

Finally, as you can see from the changeset, the introduced syntax is more coherent with the projector system used everywhere in OpenFisca, which should, sooner rather than later, IMHO, be reimplemented as a descriptor as well.

bonjourmauko commented 1 month ago

Superseded by #1255