Open nikhilwoodruff opened 2 years ago
This would help us on the UK and US systems, where we'd like to use the metadata dictionary on variables in order to link them directly to PolicyEngine. Currently, I've just monkey-patched Core since it's a trivial change (here).
I've just got a working implementation and test case for this on my fork - if it's agreeable, let me know and I'll file a PR. Thanks!
Hi @nikhilwoodruff , and thank you for this contribution proposal 😃 .
In terms of the feature itself, I'm at the same time happy to see @RamParameswaran proposed a pull request implementing it, and sad to see that I/we didn't give a timely review.
@nikhilwoodruff Does #983 cover the same use-case / feature-set?
Also, I see @RamParameswaran you mention a comeback problem: the tight coupling with the country-template for testing. Does it mean we have a "catch-22" problem (feature can't be —at least easily— tested before it being shipped)?.
In terms of coherence, and I guess this applies both to this issue and to #983, @benjello @sandcha @MattiSG do you see any precaution to be attentive to?
Contrary to parameter's, which's metadata are discreet, compatibility with both proposals for variables seems to require them to be rather arbitrary (the use-cases are different).
So I wander if:
metadata
the correct attribute-name, given it would be different from parameter's?Please do not hesitate to chime in!
Thanks @maukoquiroga - I hadn't noticed the already existing PR. #983 would cover the same use-case here for us. On your questions, from our point of view:
metadata
is the best fit, but anything else would work, the important feature for us is having flexibility within country packages.metadata
would be cleaner.Parameter
provides a great amount of flexibility for our uses, so extending that to Variable
would be great.From my point of view, the point that needs to be fully defined before introducing the ability to add arbitrary metadata is how do we normalise, accept and expose them in shared codebases.
I find the possibility of exploring use cases of metadata very attractive, but without a clear path towards normalisation, I am afraid it too easily leads to a situation where each contributor group adds their own data to the shared model and where some tools would rely on additional, nonstandard metadata, thus making them creep into other tax and benefit systems without being normalised.
Of course, each tax and benefit system stays free to do what they want, but it is in my view crucial that Core and its tooling ecosystem minimise the risk of community fragmentation through forks. Arbitrary metadata, in my opinion, opens a high risk of de facto forks where some tax and benefit systems, especially at their infancy when they most often rely on one specific actor, would come to depend strongly on some metadata.
reference
property, thus rendering Core tooling relying on it useless and leading to Australian-specific vs rest-of-the-world versions… until other models want the same for themselves and we end up with fragmented communities. I would much rather see a joint effort on improving the reference
property by using the ELI standard, for example.While the two options (improving reference
and adding arbitrary metadata
) are not exclusive and I would definitely see value in letting a tax and benefit system explore on their own before coming up with a solid proposal, I am worried that the convenience of arbitrary metadata
, exposed in the Web API, would make it too easy to stay isolated instead of opening a discussion in Core as the cost of maintaining a monkey-patching library does.
We had a glimpse of that already in France (see https://github.com/openfisca/openfisca-france/issues/1618 and https://github.com/openfisca/openfisca-france/issues/1672), and it takes a huge effort to normalise things after the fact. I'm not even done facilitating the process for France, and there will then be some RFC process to bring up to Core… We're probably talking about a total of 4-5 months to normalise 16 metadata entries.
I would prefer that we have a clear path towards normalisation of metadata through RFCs rather than adding support for arbitrary metadata. As a second choice, I believe we need to be very clear about drawing the line between:
…and have a clear path towards normalisation for the exploratory ones and clear recommendations on vendor-specific metadata.
As a reference, the metadata
entry in Parameters led to the addition of the following items on the largest TBS (France):
reference.href
and metadata.reference.title
as a replacement for leaf-level reference
entriesdocumentation
date_parution_jo
threshold_unit
, rate_unit
and amount_unit
description_en
ux_name
last_review
We will see when RFC https://github.com/openfisca/openfisca-france/issues/1672 closes how many are normalised vs how many will have to be cleaned up to assess how beneficial the exploration offered by the metadata
node was to the TBS 🙂
Thanks for the insight of what's happening in France, @MattiSG.
So, if understand correctly, metadata are being used sort of surreptitiously to accomplish two things:
So in the #983 example, if we apply a "domain-driven" lens, what we're actually adding are attributes, not just data —adding an attribute taken as adding a data model, or data schemas, explicit or implicitly—, we could for example model it as:
from typing import Sequence, Any
from dataclasses import dataclass
from openfisca_core.periods import Instant
@dataclass
class ReferenceValidity:
commencement: Instant
expiry: Instant
def __post_init__(self) -> None:
self.validate_contract()
def validate_contract(self) -> None:
if self.commencement > self.expiry:
# fail
@dataclass
class Reference:
part: int
schedule: str
labels: Sequence[Any] # actual metadata
dates: ReferenceValidity
Taken sous that lens —domain—, at first glimpse, the problem may seem unsolvable:
In both cases, or in any additional one, I cannot but see the risk of overt or surreptitious "ontological forking" remain constant, hence the problem we're trying to deal with.
On the other hand, if we take a more "cuack-driven", or exposure-driven, approach, we may for example relax 1. and at the same time render 2. stricter.
Concretely, decoupling domain models from data schemas: for example, a reference
, can be modelled in any arbitrary way, containing any arbitrary number and kind of data, as long as it's representation can be coerced to an opinionated, or agreed-upon, or arbitrary, but unique, data schema.
Even more concretely, that what's going to be served and checked-for by the Public API for reference
is ELI
, and that whatever and/or whoever you model a reference
, it has to cuack
like data that can be coerced to ELI
.
Extending my previous code example:
import abc
import eli # an invented schema validator
class SupportsELI(abc.ABC): # An abstract class just to make the point
@abc.abstractmethod
def to_json(self) -> str:
... # it has to be implemented by any metadata modelling a reference
@dataclass
class Reference(SupportsELI):
part: int
schedule: str
labels: Sequence[Any] # actual metadata
dates: ReferenceValidity
def to_json(self) -> str:
return eli.represent(self) # or something like that
In this 2nd scenario, the discussion would shift to which shared "data schemas" we expose, while letting systems sort of arbitrarily adding models to their APIs.
Which seems like a middle ground between just improving reference
and adding arbitrary metadata —and, of course, it does not address one of the problems, that is the experimenting / standardisation problem.
Just two scenarios out of my head, there are definitely more, maybe simpler.
Thanks @maukoquiroga for this higher-order rephrasing 🙂 I quite agree with your framing of the current situation and the abuses it can lead to.
The “convert to a shared representation” version (your scenario 2) is appealing in the freedom it gives to each TBS while ensuring shared tooling stays interoperable. If I understand correctly, that means switching from declarative attributes to agreeing upon an API: each Variable should then respond to getter methods instead of directly exposing attributes. These getter methods could by default (in Core) be simple attribute exposure, and every TBS could surcharge them to compute them from finer-grained attributes.
This could indeed be a solution: even though it will lead to some fragmentation (learning the specific attributes from each TBS), we could make sure that the differences are always declared in a consolidated way, just like the Entities are currently left to each TBS to define. This would also enable a higher-order standard that could be used for interoperability with other rules-as-code systems than OpenFisca.
I suggest we use this issue to consolidate use cases and learn from them, in order to assess how realistic the different options are.
For example, metadata that #983 intended to expose covers Australian-specific legislative references, enabling structured representation. @RamParameswaran could you share the use cases you had for consuming that structured data? 🙂
{
"regulation_reference": {
"part": 5,
"clause": "A",
"schedule": "34b",
"notes": ["foo", "bar"],
}
}
@nikhilwoodruff could you please share the metadata output you would like to expose in the Web API “to link [variables] directly to PolicyEngine” and how you would consume it? 🙂
Thanks for the discussion here, I can see the case for not wanting fragmentation. We're mainly using this to inform variable metadata in PolicyEngine, and the fields we need/are about to implement for that are:
min
: values are rejected if below thismax
: values are rejected if above thissoft_min
, soft_max
: when using a slider to input this field, the lower and upper boundshidden
: this is not shown in PolicyEngineThere's actually a few other attributes, but I've just noticed they're redundant and we should use existing fields (type
can be unit
). Here's an example: property_income
is a variable defined in openfisca-uk:
class self_employment_income(Variable):
value_type = float
entity = Person
label = u"Self-employment income"
documentation = "Income from self-employment profits"
definition_period = YEAR
reference = "Income Tax (Trading and Other Income) Act 2005 s. 1(1)(a)"
metadata = dict(
policyengine=dict(
roles=dict(
adult=dict(
max=80000,
),
child=dict(
hidden=True,
),
)
)
)
On the PolicyEngine household input page, you can see that this is shown for the default adult, and if you add a child, it's not shown.
[^1]: Admittedly, this isn't very generalisable - openfisca-uk has two roles, adult and child, and they are the same for each group entity (family and household). We'd want, for example, the default age for an adult to be 18, and the default age for a child to be e.g. 10.
If I understand correctly, that means switching from declarative attributes to agreeing upon an API: each Variable should then respond to getter methods instead of directly exposing attributes.
Yes.
The getter methods is a solution.
Another one could be to have a standard variable schema to represent a Variable.
Etc.,
But it is the same idea yes.
Thank you very much @nikhilwoodruff for this detailed answer! 😃
As a case study, in my opinion:
min
and max
are typically values that should be considered for normalisation, as they would represent legislative values. I am not entirely sure why we are staying away from them at this stage (@benjello, do you know?).soft_min
, soft_max
and hidden
are typically values that should not be stored in OpenFisca, no matter if in metadata
or not. They make sense to store there now because the provider of the client app (PolicyEngine) is the main (unique) contributor to the TBS, but this data does not represent policy: it is not part of the model of the legislation, it is UX choices for one specific application. If each vendor is allowed to store such data, collisions and conflicts will happen: what will distinguish PolicyEngine’s hidden
from the hidden
instruction for another app?I fully understand the convenience of having one single data and metadata source for reusers, but OpenFisca is about modelling legislation in a generic enough way that all sorts of reuses can be created. This is what makes it powerful enough to aggregate communities around a single model 🙂
We can add in some place min and max values. But some of them evolves with legislation so it may be cumbersome to store them. Al so the sign may depend on convention (that might be unstable). That's why I never felt that this subject is a priority.
@MattiSG : I agree with your comment on metadata, but we should find a way to ease downstream enrichment of metadata since it is a recurring need of many reusers.
By the way an extract of https://github.com/openfisca/openfisca-core/pull/1034#issue-718117616 I think could be pertinent here:
(...) This PR introduces the notion of
schemas
. But, contrary to regularschemas
used to validate, serialise, and deserialise data from/to models, these schemas are purely type-safe data objects, meant for type-checks, without impact at runtime.Extended note on data-schemas
What if, instead of just using
schemas
as typed-dicts for type-checks, we used them at runtime to compulsorily:a. check types at runtime? b. validate data at runtime? c. serialise/deserialise data —as suggested in #1071 ? d. enforce contracts, or domain logic —both in terms of data ("has to be >0") and representation ("tojson()")?
For
- Explicit declaration and documentation of the two most important data models of OpenFisca, today dynamic and implicit:
Formula
andTest
.- Explicit declaration and documentation of interface contracts and data transmutation —this is already, partially, done with
OpenAPI
- Complete separation of actual model and representational logic —extracting json logic from TaxBenefitSystem, Holder, and so on!
- Composability: once all data/logic are split, and data/representation through explicit schemas, then OpenFisca would become fully-modular. That is, users can use drop-in replacements tailored to their needs, as long as those replacements respect the interface contracts: data in messagepack, models in rust, a calculation engine nodejs, etc.
Against
Performance: whereas declarative data transmutation can even increase performance for individual situations —WebAPI—, a naive or out-of-the-box implementation (or anything with a complexity of at least
O(n)
) will certainly have a very penalising impact on performance for large population simulations.This due to several reasons:
- Economists already validate data before running a simulation — @benjello can you confirm?
numpy
already provides data type-casting, with configurable levels of safety, and C-optimised- Other that I cannot think of right now 🤣
Possible workarounds:
- make it opt-in with a flag —for example an optional WebAPI flag
--safe
or--strict
- implement validations, serialisation, etc., vectorially from the outset —literally extending the numpy C-API and vendoring C-extensions, pre-compiled or not
@maukoquiroga : yes we try to validate data values before injection. I also confirm that the performance of OpenFisca relies on vector computation so if we add checks that are not vectorizable it puts a huge penalty on performance.
Hi there!
I really enjoy OpenFisca, but I recently encountered an issue.
Here is what I did:
Added a
metadata: dict
attribute to a Variable class.Here is what I expected to happen:
The attribute to be silently accepted.
Here is what actually happened:
A
ValueError
was thrown, specifying thatmetadata
is not in the allowed attribute list.Context
I identify more as a: