openfisca / openfisca-core

OpenFisca core engine. See other repositories for countries-specific code & data.
https://openfisca.org
GNU Affero General Public License v3.0
170 stars 75 forks source link

Switch parameters to YAML #523

Closed MattiSG closed 7 years ago

MattiSG commented 7 years ago

The current parameter syntax is very heavy, due to its format being XML. Its vocabulary is poorly readable and made mostly of French words and abbreviations. Some attributes are inconsistent.

Example:

<NODE code="bouclier_fiscal" description="Bouclier fiscal" origin="openfisca">
  <CODE code="taux" description="Taux d'imposition maximal" format="percent" origin="openfisca">
    <END deb="2011-01-01" />
    <VALUE deb="2007-01-01" valeur="0.5" /> # Suppression du bouclier fiscal à compter des revenus réalisés en 2011. #
    <VALUE deb="2006-01-01" valeur="0.6" />
  </CODE>
</NODE>

This proposal aims at making it much easier to contribute to the parameters, by:

  1. Improving their readability.
  2. Decreasing the amount of purely syntactic tokens.
  3. Unifying the vocabulary with the rest of OpenFisca (engine & API).
  4. Improving diff readability.

1 - references

There is currently no mechanism for storing value-level references. This is obviously making some critical information miss, and leads to dirty workarounds (see https://github.com/openfisca/openfisca-france/issues/766) and lost contributors (see https://github.com/openfisca/openfisca-france/pull/728#issuecomment-304023834).

Proposition A

bouclier_fiscal.taux:
 description: Bouclier fiscal
 type: percentage
 values: 
  2006-01-01: 0.6
  2007-01-01: 0.5
  2011-01-01: null
 references:
  2006-01-01: http://example.org
  2007-01-01:
  2011-01-01: http://example.org

Proposition B

bouclier_fiscal.taux:
 description: Bouclier fiscal
 type: percentage
 values: 
  2006-01-01:
   value: 0.6
   reference: http://example.org
  2007-01-01:
   value: 0.5
  2011-01-01:
   value: null
   reference: http://example.org

2 - names

(assuming format 1B for references)

Proposition A

Inline hierarchy in the ID:

bouclier_fiscal.taux:
 description: Bouclier fiscal
 type: percentage
 values: 
  2006-01-01:
   value: 0.6
   reference: http://example.org
  2007-01-01:
   value: 0.5
  2011-01-01:
   value: null
   reference: http://example.org

Proposition B

Keep the current hierarchy:

bouclier_fiscal:
 taux:
  description: Bouclier fiscal
  type: percentage
  values: 
   2006-01-01:
    value: 0.6
    reference: http://example.org
   2007-01-01:
    value: 0.5
   2011-01-01:
    value: null
    reference: http://example.org

Means there would be a few reserved names in order to identify leaves (such as description, type, values).

Proposition C

Rely on filesystem hierarchy:

# bouclier_fiscal.yaml
taux:
 description: Bouclier fiscal
 type: percentage
 values: 
  2006-01-01:
   value: 0.6
   reference: http://example.org
  2007-01-01:
   value: 0.5
  2011-01-01:
   value: null
   reference: http://example.org

Open questions

  1. What are the conflicts attributes used for? Can they be dropped? If not, how to represent them?
  2. Can the origin attributes be absorbed in the reference attributes? I.e. can an opaque identifier be systematically superseded by an URI? (I frame it this way to show that it is definitely technically feasible, and that the question is purely semantic)
  3. What is the exact correspondence table between the current type + format attributes couple and the new type attribute? Is it bijective?
  4. Can this new format be thought of without taking into account table parameters (https://github.com/openfisca/openfisca-core/issues/473)? If not, how can we handle them?
MattiSG commented 7 years ago
  1. They are used when the Baremes-IPP merge scripts run. They should be replaced by outputs of the run, and their results left up to the user to handle.
  2. Yes. Go as close as possible to the original source. If no law reference is available, the vendor is acceptable as a reference.
  3. @michelbl says it is bijective.
  4. Yes, #473 is not yet handled today, and the “multiple names” workaround can still be used with the current format.

Other question raised by @michelbl: what about the XML comments? See openfisca/openfisca-france#766.

MattiSG commented 7 years ago

How to handle <PLACEHOLDER>?

MattiSG commented 7 years ago

Workshop results.

Discussed formats

Consensus

Open questions

benjello commented 7 years ago

My vote: 1B-2B.

michelbl commented 7 years ago

How to handle <PLACEHOLDER>?

A proposition :

# bouclier_fiscal.yaml
taux:
 description: Bouclier fiscal
 type: percentage
 values: 
  2011-01-01:
   value: 0.5
   reference: http://example.org
  2014-01-01: {}

My vote: 1B-2B.

@benjello 2B means 1 big yaml and no filesystem hierarchy, it differs with the consensus "Represent hierarchy in the filesystem." :thinking:

benjello commented 7 years ago

I am in favor of a middle ground: use the filesystem hierarchy until you arrive to the "prestation level". It does have a lot of value to be able to see all the parameters concerning one prestation in one place but isolated from the whole set of parameters.

MattiSG commented 7 years ago

I propose that nesting is hierarchy is loaded indifferently from the filesystem and the YAML inner nesting, with no arbitrary limitation. We can recommend (and give only examples) to use a file for each parameter, or at least to have it contain only a set of nodes, but leave it to the user to know what is the most readable. I would tend to consider this nesting choice as an editorial element, on which the writer is trusted to do its best to be readable.

MattiSG commented 7 years ago

For PLACEHOLDER, what about using a placeholder or expected subkey in place of value?

@benjello, does it happen often that one has a strong idea of what the next value will be, but cannot be 100% sure until the law is published? For example, is that likely?

taux:
 description: Bouclier fiscal
 type: percentage
 values: 
  2017-01-01:
   value: 0.5
   reference: http://example.org
  2018-01-01:
   expected: 0.6
benjello commented 7 years ago

Yes such situation can happen (and might had existed in parameters file). And it is definitively better to distinguish an expected value from a real value.

MattiSG commented 7 years ago

Yes such situation can happen

Thanks! My question was not very clear: how frequently can one give an expected value vs simply expecting the next update date?

benjello commented 7 years ago

No that doesn't happen very often. Sometimes some evolution are announced but the real value is the one indicated in a subsequent décret or law and it may differ.

For pension legislation, the situation may appear more often.

MattiSG commented 7 years ago

Great, thank you for these details.

I thus suggest the following format:

taux:
 description: Bouclier fiscal
 type: percentage
 values: 
  2017-01-01:
   value: 0.5
   reference: http://example.org
  2018-01-01:
   expected: 0.6  # if the value can be predicted
  2019-01-01:
   expected  # leave empty if the value cannot be predicted

In this implementation, the following cases must be handled:

  1. Written as above. This will be parsed as { "2019-01-01" : "expected" }.
  2. Written on a single line: 2019-01-01: expected. This is standard YAML and should be transparent after parsing, and will be parsed as { "2019-01-01" : "expected" }.
  3. Written with a colon and an empty value: 2019-01-01: expected:. This will be parsed as { "2019-01-01" : { "expected": null } }.

@michelbl Would sometimes getting an object and sometimes getting a string be annoying?

MattiSG commented 7 years ago

We still need to create the equivalence table between the current type + format and the new type attribute. Does anyone have the list of the current couples?

michelbl commented 7 years ago

We still need to create the equivalence table between the current type + format and the new type attribute. Does anyone have the list of the current couples?

The only documentation is the code : https://github.com/openfisca/openfisca-core/blob/master/openfisca_core/legislationsxml.py#L11 https://github.com/openfisca/openfisca-core/blob/master/openfisca_core/legislationsxml.py#L76

type -> unit

format -> format

There are other transformations, for example :

I believe this is the opportunity to agree on a common terminology between the YAML representation and the in-memory representation. This terminology would be in English and consistent with the use of capital letters. The parsing of YAML files would involve validation against a json schema and the minimal set of transformations.

michelbl commented 7 years ago

@michelbl Would sometimes getting an object and sometimes getting a string be annoying?

No, this is not annoying.

I think that during parsing, "expected" tags are kept (with or without a value). This intermediate in-memory representation can either be reduced to a given instant (to be used in formulas) or be searched for "expected" tags. Indeed, using a regex to find these tags will not be as easy with YAML as it is with XML.

MattiSG commented 7 years ago

using a regex to find these tags will not be as easy with YAML as it is with XML.

Why so? Identifying all expected leaves or properties sounds rather easy.

type -> unit format -> format

Here is the list of all values and their repartition:

   4 format: bool
 401 format: float
 152 format: integer
 952 format: percent
  25 type: age
   1 type: days
   1 type: hours
 496 type: monetary
   1 type: months

_(obtained via xsltproc extract-formats.xsl openfisca_france/parameters/*.xml | sort | uniq -c, the XSLT stylesheet being here; I checked results count consistency with grep format | wc -l)_

I thus see that the types days, hours and months are used each only once. I think this leads us to decide whether type information, which is a recurrent documentary need, should be described as normalised metadata or be put in the description.

Here is the current usage:

      <CODE code="TP_nbh" description="Nombre d'heure du temps plein" format="integer" origin="openfisca" type="hours">
        <VALUE deb="2002-01-01" valeur="1820" />
      </CODE>
      <CODE code="TP_nbj" description="Nombre de jour du temps plein" format="integer" origin="openfisca" type="days">
        <VALUE deb="2002-01-01" valeur="360" />
      </CODE>

As one can see, both description and metadata is used, and I would see a benefit in including the type in the variable name itself.

As a maintainer, I don't see the value of the type attribute in this context, as I already know the value is in hours from the description (and, possibly, from the variable name).

As a reuser, I could see the value of being able to suffix the value with a unit to present it to the user. However, maintaining a mapping of types to suffices sounds like a never-ending task to me.

Question

What is the current use of the format? Is it simply exposed to the outside as part of documentation, or does it trigger some special handling, such as dividing a percent by 100?

fpagnoux commented 7 years ago

does it trigger some special handling, such as dividing a percent by 100?

I'm pretty sure it doesn't. The percent parameters values are already divided by 100 in the XML files:

<CODE code="coeff_enfant_supplementaire" description="Coefficient à rajouter aux plafonds pour chaque enfant à charge" format="percent" origin="openfisca">
  <VALUE deb="2014-01-01" valeur="0.3"/>
</CODE>

I think the format and type metadata are only documentation, but @michelbl probably has a better knowledge of this part of the code.

benjello commented 7 years ago

Historically format is information of about how to display the valeur. It doesn't trigger any computation.

michelbl commented 7 years ago

I think the format and type metadata are only documentation,

I think type is only documentation. The value monetary may be used somewhere to pretty print a parameter : xxx €

format is used by the legislation parser to know if a parameter should be parsed as a float, an int or a boolean. However YAML knows about booleans, integers and floats, so format will not be used for that with YAML.

It seems to me that the only use of type and format is to specify the formatting to use when printing the parameter. Then only the format metadata should be kept. It's would be optional and its possible values could be monetary and percent. However this issue is common with variables (which can be percents, amounts, ages... and have to be printed at some point)...

michelbl commented 7 years ago

After a discussion with @fpagnoux we suggest two solutions :

1) Drop format and type. 2) Merge format and type to a single field. Make it optional and allow values :

Please vote @MattiSG @benjello @sandcha @Anna-Livia and any user/contributor of openfisca !

benjello commented 7 years ago

It seems to me that you should also add duration in hours, duration in months .

MattiSG commented 7 years ago

If we're only talking about metadata, we will never encompass all possible units, as the comment above shows.

To elaborate on @fpagnoux & @michelbl's proposition:

Proposition

fpagnoux commented 7 years ago

The format="percent" attribute pairs are replaced by unit="%".

To me there is a big risk of confusion. If I see:

taux:
  description: Bouclier fiscal
  unit: %
  values: 
    2017-01-01:
      value: 0.5

I really understand that the value is 0.5%, while it is 50% !

fpagnoux commented 7 years ago

For rate parameters, alternatives suggestions:

MattiSG commented 7 years ago

add unit: null, or unit: ~ to explicit the fact that they have no unit

As stated above, I think unit should be optional. No unit means no unit :wink:

What about multiplying all percentages by 100 and changing their value on load? Do we have some preprocessing already in place?

MattiSG commented 7 years ago

Okay, let's not get stuck with that part.

I propose with go forward with replacing all format="percent" by unit="/1". This looks a bit silly, but allows us to postpone that specific decision while retaining the information. We'll see later on if we want to drop the information or add some preprocessing to store percentages.

The description should always include something like “Taux de” (“Rate of ”) to be clearer about the content.

benjello commented 7 years ago

There is some information that should not be lost. I thought of ratio_point or ratio_unit but never found a good name ...

michelbl commented 7 years ago

In openfisca-france, some scales switch from french franc to the euro in 2001 (https://github.com/openfisca/openfisca-france/blob/master/openfisca_france/parameters/impot_revenu.xml#L185) while some other switch in 2002 (https://github.com/openfisca/openfisca-france/blob/master/openfisca_france/parameters/prelevements_sociaux.xml#L1767)

Since this issue is complex, it is not not part of this issue and supporting several currencies has never been stated as a goal of openfisca, I prefer unit = currency instead of unit = "€"/unit = "FRF"

michelbl commented 7 years ago

The LinearAverageRateTaxScale class cannot be created from valid XML parameters : https://github.com/openfisca/openfisca-core/blob/master/openfisca_core/legislations.py#L268. It must be created using a custom legislation subtree using modify_legislation_json. It is used in only one case : https://github.com/openfisca/openfisca-france/blob/master/openfisca_france/reforms/landais_piketty_saez.py#L85

Since this feature is not documented and used only in one place, I am willing to either break the landais_piketty_saez reform in openfisca-france, or patch it using standard openfisca tools.