test(entities): improve docs, doctests, and typing

bonjourmauko commented 3 years ago

Part of #1061 Superseded by #1220 Superseded by #1252 Superseded by #1255

Documentation

Completes the documentation of the entities module

Doctests

Completes doctest coverage of the entities module

Typing

Completes type annotations of the entities module
Consolidates types into typing, under three categories:
- types: for data-types, which are subsequently split into two sub-categories:
- XType: an actual data-type —for example an array
- XLike: an object where the actual data-type is irrelevant, but that can be coerced to X.
- protocols: for behaviours —think interfaces but for the sole purpose of static type-checking
- schemas: type-safe data-objects —mostly for dictionaries

Deprecations

Entity.set_tax_benefit_system: now provided by a standard property setter.
Entity.check_role_validity: moved to helpers as it had nothing to do with entities

New features

Renders the relationship entity 1 - * variable declarative, thanks to the _VariableProxy descriptor
Adds dataclass and slots for reduced boilerplate and improved performance

Technical changes

In GroupEntity, replaces dynamic role attributes assignment for an explicit caching mechanism
- Allows for slots and thus the performance gains
- Fixes arbitrary mutability by making attributes declarative

Notes

This PR makes an architectural choice of strong domain model / logic separation, following what has already begun with for example the split of Entity and Population, Simulation and SimulationBuilder, and so on. Although the changset is backward-compatible, future contributions enforcing this choice may introduce breaking changes.
This PR introduces the domain-driven design notion of adapter, proxy, port, or repo: beyond the pure relationship between models, the behavioural logic between them is extracted to specialised models —in this case, querying variables is extracted from Entity to a _VariableProxy. This is just a subset of 1.
This PR consolidates the notion of protocols, or duck-typing, or structural sub-typing: that means type-checks are done not against the actual implementation of a model, but against a protocol, that is, the equivalent of an interface, or an abstract model, but without impact at runtime.
Finally, and more importantly, this PR introduces the notion of schemas. But, contrary to regular schemas used to validate, serialise, and deserialise data from/to models, these schemas are purely type-safe data objects, meant for type-checks, without impact at runtime.

Extended note on data-schemas

What if, instead of just using schemas as typed-dicts for type-checks, we used them at runtime to compulsorily:

a. check types at runtime? b. validate data at runtime? c. serialise/deserialise data —as suggested in #1071 ? d. enforce contracts, or domain logic —both in terms of data ("has to be >0") and representation ("tojson()")?

For

Explicit declaration and documentation of the two most important data models of OpenFisca, today dynamic and implicit: Formula and Test.
Explicit declaration and documentation of interface contracts and data transmutation —this is already, partially, done with OpenAPI
Complete separation of actual model and representational logic —extracting json logic from TaxBenefitSystem, Holder, and so on!
Composability: once all data/logic are split, and data/representation through explicit schemas, then OpenFisca would become fully-modular. That is, users can use drop-in replacements tailored to their needs, as long as those replacements respect the interface contracts: data in messagepack, models in rust, a calculation engine nodejs, etc.

Against

Performance: whereas declarative data transmutation can even increase performance for individual situations —WebAPI—, a naive or out-of-the-box implementation (or anything with a complexity of at least O(n)) will certainly have a very penalising impact on performance for large population simulations.

This due to several reasons:

Economists already validate data before running a simulation — @benjello can you confirm?
numpy already provides data type-casting, with configurable levels of safety, and C-optimised
Other that I cannot think of right now 🤣

Possible workarounds:

make it opt-in with a flag —for example an optional WebAPI flag --safe or --strict
implement validations, serialisation, etc., vectorially from the outset —literally extending the numpy C-API and vendoring C-extensions, pre-compiled or not

MattiSG commented 3 years ago

Thanks @maukoquiroga.

This is marked as depending on #1033 but I see redundant changes, for example in the config file.

I would be in favour of having a single PR as they seem to address the same issue (https://github.com/openfisca/openfisca-doc/issues/244) and to improve documentation altogether. If you'd prefer not to for some reason, could we at least update the target branch to #1033? 🙂 The current changesets seem likely to yield conflicts.

bonjourmauko commented 3 years ago

This is marked as depending on #1033 but I see redundant changes, for example in the config file.

Corrected it ~to depend on #1038~ to depend on both #1038 and #1033, and changed base for clarity as suggested ~, the redundant changes are effectively just in the setup and config files, the rest has to be properly rebased as they're independent from #1033 (I will do shortly)~.

If you'd prefer not to for some reason, could we at least update the target branch to #1033? 🙂 The current changesets seem likely to yield conflicts.

I will rebase properly, that should get us rid of the conflicts 😃

I would be in favour of having a single PR as they seem to address the same issue (openfisca/openfisca-doc#244) and to improve documentation altogether.

I thought so initially as well, and I tried, then I decided to split this work by "submodule" and add doc as well as fixing the doctests. Here's what happened:

I tried to fix all doctests at once, which I couldn't for several reasons:
- Some parts of the code are just un-unit-testable without actual refactoring (creating a Holder for example involves almost half of the codebase, including the Holder itself…)
- Then I tried decoupling some classes to use dependency injection —even if that means breaking changes— as so to make them more granular an unit-testable, but the actual test suite do not cover all scenarios, catch-22…
- In a last effort to try to understand what I was doing I turned myself over to the documentation, which is excellent already, but still incomplete and outdated, because not checked regularly and automatically, catch-22…
Then I adopted the current attack strategy so to:
- Fix/complete the doctests of the more independent/uncoupled modules, to avoid (yet) refactoring
- Fix/complete the documentation to discover refactoring opportunities (the todos)
- To set a standard that can be applied partially but progressively over time
- To increase the confidence on the tests enough as to do the needed refactoring and breaking changes

I think we probably won't be able to fix all the doctests soon, that's why I changed from an horizontal to a vertical (submodules) approach.

Happy to have an extended discussion on that with the community if you think that could be useful!

bonjourmauko commented 3 years ago

@benjello I'd like to have your opinion on this one.

There is no breaking-change, but a proposal of a more explicit interface between entities and variables.

However, there are still three things that I don't understand:

Why do we need entities to know about variables? For you it is part of the domain model or just a shortcut for convenience? (BTW I just realised that there are also populations and group populations, but I don't yet know what they are for) ;

Anyway, I tried to make the one-to-many relationship more explicit by introducing Entity.variables, as there is already Variable.entity.

Same for group entities and roles: if I understood correctly, there can be as many roles a there are entities inside a group entity, but that relationship is implicit as there is not actual relationship between entities and group entities.
Finally, there are the famous sub-roles, which are just plain roles, added recursively to a group entity, as group_entity.SUB_ROLE, instead of for example `group_entity.roles.get("FIRST_PARENT"). Why?

benjello commented 3 years ago

@maukoquiroga:

variables belongs to an entity (they are a characteristic of an entity). Entities help you describe the model. Populations are the actual "holders" of the data.
I do not understand what do you mean by entities inside a group entity. There is an atomic entity (persons) and a group entity can have many person and as meany roles as there are different persons in those entities.
IIRC sub_roles are convenient for initialisation but also here for legacy purpose. Sorry I do not recall well where they are still used (taking a look at France may give you hints).

bonjourmauko commented 3 years ago

@maukoquiroga:

variables belongs to an entity (they are a characteristic of an entity). Entities help you describe the model. Populations are the actual "holders" of the data.

Great! So:

Entity 1 - * Population (so for example Alicia is a population, a concrete (data holder) member of a Person).
Entity 1 - * Variable (a person can have a salary, pay a tax and not another, be born) [the abstract rule]
Population - Variable (Alicia, Jorge, ... have x salary, pay the y tax, were born at specific instant).

Is that it?

I do not understand what do you mean by entities inside a group entity. There is an atomic entity (persons) and a group entity can have many person and as meany roles as there are different persons in those entities.

Ah OK so roles actually relate to concrete data holders (Person does not have a role, Alicia has).

IIRC sub_roles are convenient for initialisation but also here for legacy purpose. Sorry I do not recall well where they are still used (taking a look at France may give you hints).

Sure.

bonjourmauko commented 3 years ago

This https://github.com/openfisca/openfisca-core/pull/857

benjello commented 3 years ago

variables belongs to an entity (they are a characteristic of an entity). Entities help you describe the model. Populations are the actual "holders" of the data.

Great! So:

Yep

Entity 1 - * Population (so for example Alicia is a population, a concrete (data holder) member of a Person).

Entity 1 - * Variable (a person can have a salary, pay a tax and not another, be born) [the abstract rule]

Population - Variable (Alicia, Jorge, ... have x salary, pay the y tax, were born at specific instant).

Is that it?

Yes popualtions are the "simulation" side of the "model" (tax-benefit-system).

I do not understand what do you mean by entities inside a group entity. There is an atomic entity (persons) and a group entity can have many person and as meany roles as there are different persons in those entities.

Ah OK so roles actually relate to concrete data holders (Person does not have a role, Alicia has).

But Household do have roles, household_head, child etc ? Roles are part of of the structure of a GroupEntity. And thus every Person population has a role.

bonjourmauko commented 3 years ago

@benjello So conceptually, the Population is the actual Holder of the data (concrete Entity). Holder then is just a caching system, or should be, isn't it?

benjello commented 3 years ago

@benjello So conceptually, the Population is the actual Holder of the data (concrete Entity). Holder then is just a caching system, or should be, isn't it?

@maukoquiroga Holderis more the concrete Variable than the concrete Entity. And thus is definitively a caching system.

bonjourmauko commented 3 years ago

@MattiSG @sandcha @benjello ?

benjello commented 2 years ago

I am in favor of more typing, more doctest but I am ,ot sure I can grasp all the implications of the changes ...

bonjourmauko commented 2 years ago

The added documentation is great! However, I don't understand the consequences of introducing an abstract base class system

Thanks @MattiSG !

Concerning the abstract class system, it is not one. As I've described it:

This PR consolidates the notion of protocols, or duck-typing, or structural sub-typing: that means type-checks are done not against the actual implementation of a model, but against a protocol, that is, the equivalent of an interface, or an abstract model, but without impact at runtime.

Structural sub-typing is described in PEP 544.

Quoting it:

At runtime, protocol classes will be simple ABCs. There is no intent to provide sophisticated runtime instance and class checks against protocol classes. (...) Protocols are completely optional:

No runtime semantics will be imposed for variables or parameters annotated with a protocol class.

Any checks will be performed only by third-party type checkers and other tools.

In other words, there are no consequences beyond type-checks —no runtime impact then.

To make the point, an Entity is to an EntityProtocol the same a List is to a Sequence: instead of type-checking for implementation, we type-check for behaviour (that is why it is nicknames "duck-typing").

This has, however, one major benefit: we remove circular module dependencies from the codebase, which has an actual positive impact in terms of modularity and testability.

and this “variable proxy” system, so I cannot approve that PR 😕

That is just an abstraction, or in DDD, a port.

As I see it has two impacts:

It adds explicitness —we encapsulate the relationship between entities and variables and make it explicit
It adds modularity —again it helps decoupling models and making a clear distinction between data/logic

Beyond the "design" aspect of it, the actual recipe, the descriptor, is just standard Python —a huge deal of the function system, properties, and so on, are just syntactic sugar; behind-the-scenes they're just descriptors.

Finally, as you can see from the changeset, the introduced syntax is more coherent with the projector system used everywhere in OpenFisca, which should, sooner rather than later, IMHO, be reimplemented as a descriptor as well.

bonjourmauko commented 1 month ago

Superseded by #1255

openfisca / openfisca-core