openfisca / openfisca-core

OpenFisca core engine. See other repositories for countries-specific code & data.

https://openfisca.org

GNU Affero General Public License v3.0

168 stars 75 forks source link

Variables cannot have a `metadata: dict` attribute, unlike parameters #1071

Open nikhilwoodruff opened 2 years ago

nikhilwoodruff commented 2 years ago

Hi there!

I really enjoy OpenFisca, but I recently encountered an issue.

Here is what I did:

Added a metadata: dict attribute to a Variable class.

Here is what I expected to happen:

The attribute to be silently accepted.

Here is what actually happened:

A ValueError was thrown, specifying that metadata is not in the allowed attribute list.

Context

I identify more as a:

Developer (I create tools that use the existing OpenFisca code).

nikhilwoodruff commented 2 years ago

This would help us on the UK and US systems, where we'd like to use the metadata dictionary on variables in order to link them directly to PolicyEngine. Currently, I've just monkey-patched Core since it's a trivial change (here).

I've just got a working implementation and test case for this on my fork - if it's agreeable, let me know and I'll file a PR. Thanks!

bonjourmauko commented 2 years ago

Hi @nikhilwoodruff , and thank you for this contribution proposal 😃 .

In terms of the feature itself, I'm at the same time happy to see @RamParameswaran proposed a pull request implementing it, and sad to see that I/we didn't give a timely review.

@nikhilwoodruff Does #983 cover the same use-case / feature-set?

Also, I see @RamParameswaran you mention a comeback problem: the tight coupling with the country-template for testing. Does it mean we have a "catch-22" problem (feature can't be —at least easily— tested before it being shipped)?.

In terms of coherence, and I guess this applies both to this issue and to #983, @benjello @sandcha @MattiSG do you see any precaution to be attentive to?

Contrary to parameter's, which's metadata are discreet, compatibility with both proposals for variables seems to require them to be rather arbitrary (the use-cases are different).

So I wander if:

Is metadata the correct attribute-name, given it would be different from parameter's?
Would it be preferable to a. add a new attribute, or to b. allow for variables to have just arbitrary attributes?
What would the opportunities/risks of allowing arbitrary data on variables?
Finally, if there are opportunities, how do we better implement this so as to capitalise on real-life usage for improving openfisca-core (said differently: are forks ok, or should we just purposefully allow for arbitrary domain attributes)?

Please do not hesitate to chime in!

nikhilwoodruff commented 2 years ago

Thanks @maukoquiroga - I hadn't noticed the already existing PR. #983 would cover the same use-case here for us. On your questions, from our point of view:

I think metadata is the best fit, but anything else would work, the important feature for us is having flexibility within country packages.
Either works for us, but I think metadata would be cleaner.
Can't think of any, but probably some do exist.
We definitely at PolicyEngine want to avoid having a fork again for various reasons (pip installability, staying updated, etc.). I think the Parameter provides a great amount of flexibility for our uses, so extending that to Variable would be great.

MattiSG commented 2 years ago

From my point of view, the point that needs to be fully defined before introducing the ability to add arbitrary metadata is how do we normalise, accept and expose them in shared codebases.

I find the possibility of exploring use cases of metadata very attractive, but without a clear path towards normalisation, I am afraid it too easily leads to a situation where each contributor group adds their own data to the shared model and where some tools would rely on additional, nonstandard metadata, thus making them creep into other tax and benefit systems without being normalised.

Of course, each tax and benefit system stays free to do what they want, but it is in my view crucial that Core and its tooling ecosystem minimise the risk of community fragmentation through forks. Arbitrary metadata, in my opinion, opens a high risk of de facto forks where some tax and benefit systems, especially at their infancy when they most often rely on one specific actor, would come to depend strongly on some metadata.

983 is in my view a good example of that: if legal references are given in the suggested (admittedly clean and appealing) Australian-specific format, then the Australian tax and benefit system is very likely to not use the standard `reference` property, thus rendering Core tooling relying on it useless and leading to Australian-specific vs rest-of-the-world versions… until other models want the same for themselves and we end up with fragmented communities. I would much rather see a joint effort on improving the `reference` property by using the ELI standard, for example.

While the two options (improving reference and adding arbitrary metadata) are not exclusive and I would definitely see value in letting a tax and benefit system explore on their own before coming up with a solid proposal, I am worried that the convenience of arbitrary metadata, exposed in the Web API, would make it too easy to stay isolated instead of opening a discussion in Core as the cost of maintaining a monkey-patching library does.

We had a glimpse of that already in France (see https://github.com/openfisca/openfisca-france/issues/1618 and https://github.com/openfisca/openfisca-france/issues/1672), and it takes a huge effort to normalise things after the fact. I'm not even done facilitating the process for France, and there will then be some RFC process to bring up to Core… We're probably talking about a total of 4-5 months to normalise 16 metadata entries.

I would prefer that we have a clear path towards normalisation of metadata through RFCs rather than adding support for arbitrary metadata. As a second choice, I believe we need to be very clear about drawing the line between:

standard (Core-recognised) metadata;
exploratory (arbitrary, tax-and-benefit-system specific) metadata;
vendor-specific (use cases specific to a given reuse) metadata;

…and have a clear path towards normalisation for the exploratory ones and clear recommendations on vendor-specific metadata.

MattiSG commented 2 years ago

As a reference, the metadata entry in Parameters led to the addition of the following items on the largest TBS (France):

reference.href and metadata.reference.title as a replacement for leaf-level reference entries
documentation
date_parution_jo
threshold_unit, rate_unit and amount_unit
description_en
ux_name
last_review

We will see when RFC https://github.com/openfisca/openfisca-france/issues/1672 closes how many are normalised vs how many will have to be cleaned up to assess how beneficial the exploration offered by the metadata node was to the TBS 🙂

bonjourmauko commented 2 years ago

Thanks for the insight of what's happening in France, @MattiSG.

So, if understand correctly, metadata are being used sort of surreptitiously to accomplish two things:

Add new domain models (nested inside a dict).
Add new web-api plugins (de facto), thus modifying the API contracts.

So in the #983 example, if we apply a "domain-driven" lens, what we're actually adding are attributes, not just data —adding an attribute taken as adding a data model, or data schemas, explicit or implicitly—, we could for example model it as:

from typing import Sequence, Any

from dataclasses import dataclass

from openfisca_core.periods import Instant

@dataclass
class ReferenceValidity:
    commencement: Instant
    expiry: Instant

    def __post_init__(self) -> None:
        self.validate_contract()

    def validate_contract(self) -> None:
        if self.commencement > self.expiry:
            # fail

@dataclass
class Reference:
    part: int
    schedule: str
    labels: Sequence[Any]  # actual metadata
    dates: ReferenceValidity

Possible scenario 1

Taken sous that lens —domain—, at first glimpse, the problem may seem unsolvable:

either we purposefully coerce groups of domain models to a set of opinionated, or agreed-upon, or arbitrary ones, like ELI for references,
either we don't, for example by allowing arbitrary, nested, domain/data models/schemas, and then we're back to base.

In both cases, or in any additional one, I cannot but see the risk of overt or surreptitious "ontological forking" remain constant, hence the problem we're trying to deal with.

Possible scenario 2

On the other hand, if we take a more "cuack-driven", or exposure-driven, approach, we may for example relax 1. and at the same time render 2. stricter.

Concretely, decoupling domain models from data schemas: for example, a reference, can be modelled in any arbitrary way, containing any arbitrary number and kind of data, as long as it's representation can be coerced to an opinionated, or agreed-upon, or arbitrary, but unique, data schema.

Even more concretely, that what's going to be served and checked-for by the Public API for reference is ELI, and that whatever and/or whoever you model a reference, it has to cuack like data that can be coerced to ELI.

Extending my previous code example:

import abc
import eli  # an invented schema validator

class SupportsELI(abc.ABC):   # An abstract class just to make the point
    @abc.abstractmethod
    def to_json(self) -> str:
        ...   # it has to be implemented by any metadata modelling a reference

@dataclass
class Reference(SupportsELI):
    part: int
    schedule: str
    labels: Sequence[Any]  # actual metadata
    dates: ReferenceValidity

    def to_json(self) -> str:
        return eli.represent(self)   # or something like that

In this 2nd scenario, the discussion would shift to which shared "data schemas" we expose, while letting systems sort of arbitrarily adding models to their APIs.

Which seems like a middle ground between just improving reference and adding arbitrary metadata —and, of course, it does not address one of the problems, that is the experimenting / standardisation problem.

Just two scenarios out of my head, there are definitely more, maybe simpler.

MattiSG commented 2 years ago

Thanks @maukoquiroga for this higher-order rephrasing 🙂 I quite agree with your framing of the current situation and the abuses it can lead to.

The “convert to a shared representation” version (your scenario 2) is appealing in the freedom it gives to each TBS while ensuring shared tooling stays interoperable. If I understand correctly, that means switching from declarative attributes to agreeing upon an API: each Variable should then respond to getter methods instead of directly exposing attributes. These getter methods could by default (in Core) be simple attribute exposure, and every TBS could surcharge them to compute them from finer-grained attributes.

This could indeed be a solution: even though it will lead to some fragmentation (learning the specific attributes from each TBS), we could make sure that the differences are always declared in a consolidated way, just like the Entities are currently left to each TBS to define. This would also enable a higher-order standard that could be used for interoperability with other rules-as-code systems than OpenFisca.

MattiSG commented 2 years ago

I suggest we use this issue to consolidate use cases and learn from them, in order to assess how realistic the different options are.

For example, metadata that #983 intended to expose covers Australian-specific legislative references, enabling structured representation. @RamParameswaran could you share the use cases you had for consuming that structured data? 🙂

{
    "regulation_reference": {
        "part": 5,
        "clause": "A",
        "schedule": "34b",
        "notes": ["foo", "bar"],
    }
}

@nikhilwoodruff could you please share the metadata output you would like to expose in the Web API “to link [variables] directly to PolicyEngine” and how you would consume it? 🙂

nikhilwoodruff commented 2 years ago

Thanks for the discussion here, I can see the case for not wanting fragmentation. We're mainly using this to inform variable metadata in PolicyEngine, and the fields we need/are about to implement for that are:

min: values are rejected if below this
max: values are rejected if above this
soft_min, soft_max: when using a slider to input this field, the lower and upper bounds
hidden: this is not shown in PolicyEngine
overrides for each role[^1]

There's actually a few other attributes, but I've just noticed they're redundant and we should use existing fields (type can be unit). Here's an example: property_income is a variable defined in openfisca-uk:

class self_employment_income(Variable):
    value_type = float
    entity = Person
    label = u"Self-employment income"
    documentation = "Income from self-employment profits"
    definition_period = YEAR
    reference = "Income Tax (Trading and Other Income) Act 2005 s. 1(1)(a)"
    metadata = dict(
        policyengine=dict(
            roles=dict(
                adult=dict(
                    max=80000,
                ),
                child=dict(
                    hidden=True,
                ),
            )
        )
    )

On the PolicyEngine household input page, you can see that this is shown for the default adult, and if you add a child, it's not shown.

[^1]: Admittedly, this isn't very generalisable - openfisca-uk has two roles, adult and child, and they are the same for each group entity (family and household). We'd want, for example, the default age for an adult to be 18, and the default age for a child to be e.g. 10.

bonjourmauko commented 2 years ago

If I understand correctly, that means switching from declarative attributes to agreeing upon an API: each Variable should then respond to getter methods instead of directly exposing attributes.

Yes.

The getter methods is a solution.

Another one could be to have a standard variable schema to represent a Variable.

Etc.,

But it is the same idea yes.

MattiSG commented 2 years ago

Thank you very much @nikhilwoodruff for this detailed answer! 😃

As a case study, in my opinion:

min and max are typically values that should be considered for normalisation, as they would represent legislative values. I am not entirely sure why we are staying away from them at this stage (@benjello, do you know?).
soft_min, soft_max and hidden are typically values that should not be stored in OpenFisca, no matter if in metadata or not. They make sense to store there now because the provider of the client app (PolicyEngine) is the main (unique) contributor to the TBS, but this data does not represent policy: it is not part of the model of the legislation, it is UX choices for one specific application. If each vendor is allowed to store such data, collisions and conflicts will happen: what will distinguish PolicyEngine’s hidden from the hidden instruction for another app?

I fully understand the convenience of having one single data and metadata source for reusers, but OpenFisca is about modelling legislation in a generic enough way that all sorts of reuses can be created. This is what makes it powerful enough to aggregate communities around a single model 🙂

benjello commented 2 years ago

We can add in some place min and max values. But some of them evolves with legislation so it may be cumbersome to store them. Al so the sign may depend on convention (that might be unstable). That's why I never felt that this subject is a priority.

@MattiSG : I agree with your comment on metadata, but we should find a way to ease downstream enrichment of metadata since it is a recurring need of many reusers.

bonjourmauko commented 2 years ago

By the way an extract of https://github.com/openfisca/openfisca-core/pull/1034#issue-718117616 I think could be pertinent here:

(...) This PR introduces the notion of schemas. But, contrary to regular schemas used to validate, serialise, and deserialise data from/to models, these schemas are purely type-safe data objects, meant for type-checks, without impact at runtime.

Extended note on data-schemas

What if, instead of just using schemas as typed-dicts for type-checks, we used them at runtime to compulsorily:

a. check types at runtime? b. validate data at runtime? c. serialise/deserialise data —as suggested in #1071 ? d. enforce contracts, or domain logic —both in terms of data ("has to be >0") and representation ("tojson()")?

For

Explicit declaration and documentation of the two most important data models of OpenFisca, today dynamic and implicit: Formula and Test.

Explicit declaration and documentation of interface contracts and data transmutation —this is already, partially, done with OpenAPI

Complete separation of actual model and representational logic —extracting json logic from TaxBenefitSystem, Holder, and so on!

Composability: once all data/logic are split, and data/representation through explicit schemas, then OpenFisca would become fully-modular. That is, users can use drop-in replacements tailored to their needs, as long as those replacements respect the interface contracts: data in messagepack, models in rust, a calculation engine nodejs, etc.

Against

Performance: whereas declarative data transmutation can even increase performance for individual situations —WebAPI—, a naive or out-of-the-box implementation (or anything with a complexity of at least O(n)) will certainly have a very penalising impact on performance for large population simulations.

This due to several reasons:

Economists already validate data before running a simulation — @benjello can you confirm?

numpy already provides data type-casting, with configurable levels of safety, and C-optimised

Other that I cannot think of right now 🤣

Possible workarounds:

make it opt-in with a flag —for example an optional WebAPI flag --safe or --strict

implement validations, serialisation, etc., vectorially from the outset —literally extending the numpy C-API and vendoring C-extensions, pre-compiled or not

benjello commented 2 years ago

@maukoquiroga : yes we try to validate data values before injection. I also confirm that the performance of OpenFisca relies on vector computation so if we add checks that are not vectorizable it puts a huge penalty on performance.