openfisca / openfisca-core

OpenFisca core engine. See other repositories for countries-specific code & data.
https://openfisca.org
GNU Affero General Public License v3.0
168 stars 75 forks source link

RFC: Formula signatures #890

Open Morendil opened 5 years ago

Morendil commented 5 years ago

RFC closing: October 1, 2019

Thoughts on formula signatures

One of the major forces shaping OpenFisca today is the signature for formulas:

def formula<date>(population, period, parameters):
    …

This signature implicitly defines a great deal of the contract between Core and country models; or to say it differently, it both empowers and constraints formula authors in significant ways. In turn, there are very strong constraints on Core arising from the way formula authors use and rely on this contract.

This RFC intends to catalog:

Not all problems with OpenFisca are related to how formulas are written, but many of them are. Because these problems are inter-related, this RFC takes the perspective that we're better off taking a "big picture" view, rather than trusting only to incremental design to solve these problems one by one. By iterating towards a clearly defined goal, we might well end up in a better place. We will still be iterating and not doing a lot of up-front design!

The draft proposal for an alternative approach is very likely not perfect and leaves room for improvement, but as it stands it preserves all the achievements of OpenFisca, while promising to solve many of our outstanding problems.

What the formula signature affords

The main benefits provided by the formula signature are as follows:

Accessing the variables that a formula depends on

OpenFisca is all about computing values that depend on other values:

class revenu_net_imposable(Variable):
    def formula(foyer_fiscal, period, parameters):
        revenu_net_global = foyer_fiscal('revenu_net_global', period)
        abattements = foyer_fiscal('abattements', period)
        return revenu_net_global - abattements

This is OpenFisca's way of expressing the equation "revenu_net_imposable = revenu_net_global - abattements".

It relies on the first argument to the formula, which we generically call the "entity" or "population" argument. (Historically the term "entity" was the only one used, "population" is a more recent distinction which highlights the two levels at which OpenFisca functions: abstract - there are such entities as families which comprise individuals - and concrete - we have a population of specific individuals grouped into specific families.)

The syntax for requesting a variable's value is function-call-like. This is related to the next benefit.

Period-shifting variables

OpenFisca is strongly concerned with time. This is evident in the second parameter "period". These periods are instance of a class Period concerned with units of time (days, months, years) and spans of these units, e.g. the period from january to june 2019.

In particular, the value of a variable at a given time may depend on the value of another at a different time. This is often the case in the socio-fiscal domain where individual claimants may be entitled to social aids based on their revenues; to verify the declared revenues, the State generally relies on fiscal documents, which are delayed by up to two years from when the revenue is accrued. (More generally, the State often relies on proof by documents it has itself produced, and the time delays represent its own bureaucratic inertia.)

A representative case:

class livret_epargne_populaire_eligibilite(Variable):
    def formula(individu, period, parameters):
        rfr = individu.foyer_fiscal('rfr', period.n_2)
        plafond = individu('livret_epargne_populaire_plafond', period)
        return rfr <= plafond

Here we are concerned with "fiscal revenues declared the year before last", and relying on methods of Period to shift the period; this modified period can then be supplied as an argument to the function call on "individu".

This is not the only way (and perhaps not the main way) time matters in OpenFisca.

Time-keying parameters to represent parametric evolution

One of the hallmarks of social legislation is that it changes periodically to reflect changed economic conditions. Often this is purely "parametric". In France, the minimum wage is subject to an annual increase (and occasional "nudges" beyond that). Nothing about the law changes, only the amount of the hourly rate called the "minimum wage" (which is present in many formulas of OpenFisca France as it determines many things other than how much people are paid).

Here is a typical use:

class aah_activite(Variable):
    def formula(individu, period, parameters):
        smic_horaire = parameters(period).cotsoc.gen.smic_h_b
        seuil_aah_activite = parameters(period).prestations.minima_sociaux.ppa.seuil_aah_activite * smic_horaire
        revenus_activites = individu('revenus_activites', period)
        return (revenus_activites >= seuil_aah_activite) * individu('aah', period)

This formula combines two parameters of the law. These parameters may well vary independently (and in fact usually do). The formula evaluates the value of the parameters for a given period, by first dereferencing the "parameters" arguments with a function-call-like syntax.

The arguments "period" and "parameters" are quite often used together in this way. (This syntax also allows for time-shifting parameters, but it's not clear that this is ever used.)

Of course, not all evolutions of the law are as simple as adjusting a parameter.

Time-keying formulas to represent structural evolutions

The law also occasionally changes in more drastic ways. Take a flight of fancy and suppose that in 2020 the minimum wage is abolished altogether. What to do?

OpenFisca aims to represent not only the law as it is, but the law as it was and the law as it might be. One of our goals with OpenFisca is to compare the effects of past, present and proposed law on the economic situation of actual people, so that we can understand the effects of legislation and participate accordingly in political debate about these changes.

In this hypothetical case, we would expect to see something like the following:

class aah_activite(Variable):
    def formula_2020_01_01(individu, period, parameters):
        revenus_activites = individu('revenus_activites', period)
        return (revenus_activites >= 0) * individu('aah', period)
    def formula(individu, period, parameters):
        …previous formula…

The old formula, relying on the steady change of parameters, has been replaced by a new formula altogether. Although it has some structural similarities with the previous formula, it is easier to rewrite it to account for the new face of social legislation as of 2020.

Aggregating individual values of a group entity

Other than time, OpenFisca is also concerned significantly with social groupings we call "entities", the units to which the law grants certain rights; for instance some social benefits are given at the level of the family (individuals related by blood), the fiscal unit (individuals declaring fiscal revenues), or the household (individuals living in the same housing).

All variables relate to a specific entity, either the individual (which must exist in all country models) or a group.

In one common case we know the values of a variable for each member of a grouping, and there is a meaningful way to aggregate these values over the entire grouping:

def formula(famille, period, parameters):
    …
    rsa_base_ressources_i = famille.members('rsa_base_ressources_individu', period)
    rsa_base_ressources_i_total = famille.sum(rsa_base_ressources_i)
    …

In this case the aggregation operation is a sum. Other, less common ways of aggregating include taking the minimum or the maximum. In some cases we might also be interested in evaluating a predicate over the group (for instance "is a child with independent revenues") and counting members meeting that predicate. There is a notion of "role of the individual in the group they belong to", for instance "child", "parent", "person filing the taxes", which is often used as a predicate, but even more often in our next use case.

Pivoting from one group entity to another

It sometimes happens, though it's uncommon, that legislation requires "bridging a gap" between one kind of grouping and another.

For instance, the household tax is paid at the houshold level (people sharing a housing unit) but it can be taken into account in fiscal law, and a formula in the latter domain might need to "transport" the tax information from one entity to the other.

Suppose that two young people live at the same address, one of them with a baby, but as roommates rather than as a family unit. This, then, is a single household but two fiscal units.

def formula(foyer_fiscal, period, parameters):
    …
    taxe_habitation_i = foyer_fiscal.members.menage('taxe_habitation', period)
    taxe_habitation = foyer_fiscal.sum(taxe_habitation_i, role = Menage.PERSONNE_DE_REFERENCE)

These two lines of code function conceptually as a single operation, distinct from aggregation. In the first line, the expression foyer_fiscal.members.menage allows us to retrieve a value which contains, for each individual, the level of household tax for the household that the individual belongs to.

In our hypothetical example, there are three people, so this value will consist of three copies of the tax for that household. Since there are two tax units, if we only summed these values at the household level, we would end up with one unit seen as paying 2X and the other paying X, where X is the tax amount.

This does not reflect reality, so instead of one the household roles serves as a "pivot" to go from tax unit to household. Only one person in each household can have this role, so this ensures that of the two tax units, one will be seen as paying 0, and the other will be paying X.

The net result is that summing household tax over tax units should give us exactly the same result as summing over households; this is a consistency constraint, since it represents the same economic notion.

Problems with the formula signature

So the current way of expressing formulas solves lots of problems, and that's a good thing! But it also gives rise to a bunch of frustrations.

So what are the problems?

Let's examine these:

Privileging the exceptional case over the common case

In the overwhelming majority of formulas the following are the case:

This means that in the vast majority of cases the "period" argument is superfluous, in the sense that the formula does not need to examine it, doesn't do anything to it, but just passes it on to other calls. In the general case, we are saying something more like this, with the period being implicit ("whatever period we're looking at right now"):

class revenu_net_imposable(Variable):
    def formula(foyer_fiscal, parameters):
        revenu_net_global = foyer_fiscal('revenu_net_global')
        abattements = revenu_net_global * parameters.taux_abattement
        return revenu_net_global - abattements

The population argument is also superfluous if we do not need to aggregate or pivot. In fact, historically OpenFisca formulas used to look like this:

class revenu_net_imposable(Variable):
    def formula(simulation, parameters):
        revenu_net_global = simulation.calculate('revenu_net_global')
        abattements = revenu_net_global * parameters.taux_abattement
        return revenu_net_global - abattements

In fact, "parameters" are not something free-floating, they are supplied by the simulation so we could also have:

class revenu_net_imposable(Variable):
    def formula(simulation):
        revenu_net_global = simulation.calculate('revenu_net_global')
        abattements = revenu_net_global * simulation.parameter('taux_abattement')
        return revenu_net_global - abattements

This would be a more uniform, more Python-like syntax with fewer surprises such as the function-call-like protocols of "foyer_fiscal" and "parameters", and would be preferable if we had an alternative for the few cases where we need to period-shift, aggregate or pivot.

Lack of expressive power for abstraction, in particular parametrizing formulas

The above form would be preferable but still not ideal. Consider a common case where a formula depends on many variables:

class complex_var(Variable):
    def formula(simulation):
        var1 = simulation.calculate('var1')
        var2 = simulation.calculate('var2')
        var3 = simulation.calculate('var3')
        …etc…

This is a lot of boiler-plate code, which is reason enough to be dissatisfied. But much worse than that, it makes OpenFisca a programming language where all variables are effectively globals, and in which functions (formulas) are not allowed to call each other with parameters.

This becomes painfully evident where the law calls for a complex computation, which already exists elsewhere, with a hypothetical input. Suppose a law proposes a reduction in social security contributions for employers, and this reduction is "20% of the contributions that would be paid at 50% over minimum wage".

We would like to do this:

class social_security_contributions(Variable):
    def formula(person, period, parameters):
        salary = person.calculate('salary')
        …

class contribution_reduction(Variable):
    def formula(person, period, parameters):
        minwage = parameters(period).minwage
        at_half_minwage = social_security_contributions(salary = minwage / 2)
        return 0.2 * at_half_minwage

We can't parametrize our formula that way, since "salary", an input to "social_security_contributions", can only be computed by explicitly asking for it. It lives in a memory space that is global to the entire set of formulas. So, when this happens in practice, we have to resort to contortions which often come down to copy-pasting formulas to adapt them to a specific case.

The following arrangement instead would work much better:

class social_security_contributions(Variable):
    def formula(simulation, salary):
        …

class contribution_reduction(Variable):
    def formula(simulation):
        minwage = simulation.parameters('minwage')
        at_half_minwage = social_security_contributions(simulation, salary = minwage / 2)
        return 0.2 * at_half_minwage

Difficulty of performing static dependency analysis

In addition to treating all variables as global and requiring lots of boilerplate, the current approach has another drawback: it makes static analysis (the question "in general, what variables depend on what variables") much harder to perform. Generating a dependency graph takes clever hacks, parsing of Python code, or other contortions.

Violating OO design principles in the population argument

The motivation for introducing the population argument is explained here and in the linked discussion: https://github.com/openfisca/openfisca-core/pull/415

There is no doubt that this change was overall for the better, but it did have drawbacks.

It is difficult to state with precision the type of the "population" object, the first argument in the formula signature:

def formula<date>(population, period, parameters):
    …

It can't be Population, because sometimes we will call the sum() method on it. It can't be GroupPopulation, because we sometimes pass the individuals. So we could say we in fact have two signatures, one for "individual" variables and one for "group" variables:

def formula<date>(individus : Population, period, parameters):
    …

def formula<date>(population : GroupPopulation, period, parameters):
    …

Are these two types completely disjoint? Obviously not, they share a lot of services:

These commonalities made it tempting to have GroupPopulation inherit from Population. This is problematic if only because Population::has_role only makes sense for individuals (it is individuals who have roles within a group), but this method belongs to the contract of GroupPopulation by inheritance. You can call has_role on a group and that will return something with the "wrong" number of items.

Moreover, "triggering a computation" - which is what the function-call-like syntax is for - is not properly a responsibility of the population. It's going to be delegated to the simulation anyway. So this part of the shared protocol only exists for syntactic convenience.

Navigation between entities is confusing and leaves too much up to formula authors

The current implementation makes it hard to reason about the relationship between entities. For instance, to re-use a snippet from the "pivoting" section:

def formula(foyer_fiscal, period, parameters):
    …
    taxe_habitation_i = foyer_fiscal.members.menage('taxe_habitation', period)
    taxe_habitation = foyer_fiscal.sum(taxe_habitation_i, role = Menage.PERSONNE_DE_REFERENCE)

It is hard to reason one's way to this result. In the first line, it takes some guessing to know what "foyer_fiscal.members.menage" will do. It's hard to even know what kind of thing it is.

If you're tempted to say "it's a group entity representing households", because that's what "menage" means, you would be wrong: the 'taxe_habitation' computation will return a value with as many items as there are individuals. This is why we have to be careful to sum with the role as pivot. What would happen if we used a Role from a differenty entity, for instance Famille? It's hard to say.

So, in general, formula authors often rely on copy-pasting from existing code examples and adapting them, a form of cargo-cult coding. Or they have to suffer long debugging sessions and cryptic error messages.

Ignorance of the semantic roles of variable dependencies (conditions, selectors, aggregations, etc.)

What the above argument implies is that the facilities offered by OpenFisca to formula authors are still too low-level, too close to the underlying Numpy implementation. (Even though, historically, this level of abstraction has been rising, which is a good thing, there is a risk that the current design becomes too much a "drag" on raising it further and effectively stops progress in this regard.)

Among the consequences: some extremely reliable and repeatable "patterns" of implementing law into Python code are reinvented many times over by different formula authors, each implementing them in an idiosyncratic way.

One emblematic case is the lowly condition variable, often seen in return clauses:

class aah_activite(Variable):
    def formula(individu, period, parameters):
        smic_horaire = parameters(period).cotsoc.gen.smic_h_b
        seuil_aah_activite = parameters(period).prestations.minima_sociaux.ppa.seuil_aah_activite * smic_horaire
        revenus_activites = individu('revenus_activites', period)
        return (revenus_activites >= seuil_aah_activite) * individu('aah', period)

This arguably hides something that could play a large role both in simplifying code and making performance gains, namely that the first variable is a condition on the second:

class aah_activite_eligibilite(Variable):
    def formula(individu, period, parameters):
        smic_horaire = parameters(period).cotsoc.gen.smic_h_b
        seuil_aah_activite = parameters(period).prestations.minima_sociaux.ppa.seuil_aah_activite * smic_horaire
        revenus_activites = individu('revenus_activites', period)
        return (revenus_activites >= seuil_aah_activite)

class aah_activite(Variable):
    formula = conditional_on('aah_activite_eligibilite', 'aah')

In many cases, instead of having to write formulas, we could give formula authors the tools to just compose them from existing elements.

Similarly:

class aide_logement(Variable):
    def formula(famille, period):
        apl = famille('apl', period)
        als = famille('als', period)
        alf = famille('alf', period)

        return max_(max_(apl, als), alf)

We could do this instead:

class aide_logement(Variable):
    formula = maximum_of('apl', 'als', alf')

The benefits are two-fold: the code would be simpler to read, and we could implement optimizations "under the hood".

Lack of symmetry in evolution of law over time

This is a relatively minor complaint: to make a formula start on a date, the syntax is formula_, but to make it end on a date, one must use an "end" attribute to the variable.

A proposed alternative approach

Is there One True Wonder Syntax that solves all of the above desiderata and all of the above problems? We don't know yet! But one way to learn for sure is to make a proposal, or several, and try to improve on them.

Like any prototype, the idea is that it's cheaper to kill a bad proposal than it is to recover from a bad implementation. To "kill" a proposal, there are questions we can ask: "is it hard to do something that we do often", "is it very hard to do something we do from time to time", "will it be hard to migrate from the existing model code to that", and so on.

In that spirit:

Variables: creating and accessing from formulas

@model_variable
def create():
    return make_variable(
        value_type = float
        entity = FoyerFiscal
        label = u"Revenu net imposable"
        reference = "http://impotsurlerevenu.org/definitions/115-revenu-net-imposable.php"
        definition_period = YEAR
    )

@variable('revenu_net_imposable')
def formula(revenu_net_global, abattements):
    return revenu_net_global - abattements

Time-keying

@period(from = "2019-01-01", to = "2022-01-01")
@variable('revenu_net_imposable')
def formula(revenu_net_global, abattements):
    return revenu_net_global - abattements

Time-shifting variables

@variable('livret_epargne_populaire_eligibilite')
@period_shift('rfr', n_2)
def formula(rfr, plafond):
    return rfr <= plafond

Parameters

@variable('aah_activite_eligibilite')
@parameter('smic_horaire', 'cotsoc.gen.smic_h_b')
@parameter('ppa', 'prestations.minima_sociaux.ppa')
def formula(smic_horaire, ppa, revenus_activites):
    return (ppa.seuil_aah_activite >= seuil)

Scale polymorphism

@variable('impot_revenu')
@marginal_scale('bareme', 'impot_revenu.bareme')
def formula(bareme, salaire_imposable):
    return bareme.calc(salaire_imposable)

Conditionals

@variable('aah_activite')
@condition('aah_activite_eligibilite')
def formula():
    …formula of 'aah'

Selection

@variable('aide_logement')
    (select_one('aide_logement_categorie', ['apl','als','alf']))

Aggregating

@variable('rsa')
@aggregate('rsa_base_ressources', sum)
def formula(rsa_base_ressources):
    …

Pivoting

@variable('bouclier_fiscal')
@pivot('taxe_habitation')
def formula(taxe_habitation, …):
    …

Parametrizing (formulas calling formulas)

@variable('social_security_contributions')
def formula(salary):
    …

@variable('contribution_reduction')
@calls('social_security_contributions', 'salary')
@parameter('minwage', 'cotsoc.gen.smic_h_b')
def formula(social_security_contributions, minwage):
    at_half_minwage = social_security_contributions(salary = minwage / 2)
    return 0.2 * at_half_minwage

Fin

That's it! Please chime in with critiques, suggestions for improvement or general comments.

benjello commented 5 years ago

It is a nice proposal but few years ago the current notations have been adopted to allow non coders easily read the formula. It was also allowed to use some helper functions to lighten too much repetitive code.

I think that non coder user as economists will get lost if they should use decorators and a non linear style of programming the functions.

benjello commented 5 years ago

Nevertheless a few simplification could be introduced. For example for simple case where all the variable are take for the specified entity and at the current period we may switch to a more direct formula.

bonjourmauko commented 5 years ago

@Morendil Thanks for this contribution, I'm all in!

@benjello I think we should have a discussion with the community to weight pros and cons, and see if and how we could deploy an incremental implementation roadmap.

I agree that the level of abstraction might be overwhelming for non-coders, on the other hand this would render :

I think this is a change changer 👍