Restructure test_data - Githubissues

ChristianZimpelmann commented 3 years ago

Current and desired situation

Input data for tests in tests/test_data are currently defined in .csv and .ods files.

Only one data format, which does not allow for formulas, but can be labelled

Proposed implementation

yaml

Considered alternatives

DataFrames in pickle

hmgaudecker commented 2 years ago

Also see discussion at https://gettsim.zulipchat.com/#narrow/stream/309993-Data-to.20be.20fed.20in/topic/test.20data/near/271147372

hmgaudecker commented 2 years ago

In particular, think about structured way of including metadata on the source of the test data:

Specifically, I would like to be able to document what is the exact source for a test case, which variables and values are given in that source and which variables are manually calculated or assumed to bring these external examples into shape for GETTSIM.

ChristianZimpelmann commented 2 years ago

@LauraGergeleit and me were trying around how we could represent tests in a yaml-file.

One test case (for two individuals in the same household) could look as follows:

2017:
-   individuals:
    -   inputs:
            alleinerziehend: false
            alter: 72
            arbeitsl_geld_m: 0
            bewohnt_eigentum_hh: false
            bruttokaltmiete_m_hh: 460.0
            bruttolohn_m: 0
            eink_selbst_m: 0
            eink_st_tu: 0
            elterngeld_m: 0
            grundrentenzeiten: 300
            heizkosten_m_hh: 60.0
            kapital_eink_m: 0
            kind: false
            kindergeld_m_hh: 0
            prv_rente_m: 0
            rentner: true
            schwerbe_ausweis_g: false
            soli_st_tu: 0
            sonstig_eink_m: 0
            sozialv_beitr_m: 0
            staatl_rente_m: 860.6
            unterhaltsvors_m: 0
            vermiet_eink_m: 0
            vermögen_hh: 0
            wohnfläche_hh: 60
        outputs:
            grunds_im_alter_m_hh: 322.0
            regelbedarf_m_grunds_im_alter_vermögens_check_hh: 1256.0
    -   inputs:
            alleinerziehend: false
            alter: 67
            arbeitsl_geld_m: 0
            bewohnt_eigentum_hh: false
            bruttokaltmiete_m_hh: 460.0
            bruttolohn_m: 0
            eink_selbst_m: 0
            eink_st_tu: 0
            elterngeld_m: 0
            grundrentenzeiten: 48
            heizkosten_m_hh: 60.0
            kapital_eink_m: 0
            kind: false
            kindergeld_m_hh: 0
            prv_rente_m: 0
            rentner: true
            schwerbe_ausweis_g: false
            soli_st_tu: 0
            sonstig_eink_m: 0
            sozialv_beitr_m: 0
            staatl_rente_m: 73.4
            unterhaltsvors_m: 0
            vermiet_eink_m: 0
            vermögen_hh: 0
            wohnfläche_hh: 60
        outputs:
            grunds_im_alter_m_hh: 322.0
            regelbedarf_m_grunds_im_alter_vermögens_check_hh: 1256.0
    info:
        columns_from_source:
        - alleinerziehend
        - eink_selbst_m
        - bruttolohn_m
        note: Some space for a note
        source: https://www.bpb.de/politik/innenpolitik/rentenpolitik/289395/leistungshoehe-und-fallbeispiele

The yaml would be structured as follows:

outermost level: a dict of years
a list of test cases (each test case is associated with one household)
each test case consists of two dictionarys:
1. input and output data for a list of individuals in that household
2. additional info including which columns are taken from source, a link to the source, and a note (we should think more deeply about which information is relevant here)

What do you think? If you think it goes in the right direction, I would start a GEP and go more into detail.

Questions I am asking myself at the moment:

The yaml file can become quite long as the definition of each individual takes 30+ lines. csv-files were more efficient in that sense. But I would still prefer the yaml version.
We usually have several test cases for each source such that information about this source would be repeated several times. Might be beneficial to group for the source of the tests at the outermost level?
Alternative way to represent columns_from_source: split up list inputs into inputs_from_source, inputs_implied_by_us (and maybe inputs_from_source_adjusted_by_us)

@mjbloemer @hmgaudecker

hmgaudecker commented 2 years ago

Thanks, great start!

Based on that and some thinking, I believe I'd prefer something like the following:

2017:
  bpb_altenköster:
    inputs_given:
      p_id: [0, 1]
      tu_id: [0, 0]
      hh_id: 0
      alter: [72, 67]
      bruttokaltmiete_m_hh: 460.0
      heizkosten_m_hh: 60.0
      ges_rente_m: [860.6, 4]
      rentner: [true, true]
      schwerbe_ausweis_g: [false, false]
      alleinerziehend: [false, false]
      bruttolohn_m: [0, 0]
      sonstig_eink_m: [0, 0]
      eink_selbst_m: [0, 0]
    inputs_assumed:
      arbeitsl_geld_m: [0, 0]
      bewohnt_eigentum_hh: false
      eink_st_tu: [0, 0]
      elterngeld_m: [0, 0]
      grundrentenzeiten: [300, 48]
      kapital_eink_m: [0, 0]
      kind: [false, false]
      kindergeld_m_hh: 0
      prv_rente_m: [0, 0]
      soli_st_tu: [0, 0]
      sozialv_beitr_m: [0, 0]
      unterhaltsvors_m: [0, 0]
      vermiet_eink_m: [0, 0]
      vermögen_hh: 0
      wohnfläche_hh: 60
    outputs:
      grunds_im_alter_m_hh: 322.0
      regelbedarf_m_grunds_im_alter_vermögens_check_hh: 1256.0
    info:
      note: Beispiel Ehepaar Altenköster. Warmmiete split up in arbitrary fashion.
      source: https://www.bpb.de/politik/innenpolitik/rentenpolitik/289395/leistungshoehe-und-fallbeispiele

That is, one entry per household and separate what we get from the example and what we assume (valid keys there would be inputs or inputs_given (+ inputs_assumed), exclusive or).

Separating inputs_given and inputs_assumed will not always be fully obvious. E.g., in the above example, it seems clear that these people's only income is the pension (aside -- staatl_rente is confusing me each time again. Please get rid of it everywhere asap. It is just plain wrong. Apologies and thank you!), even though there are no numbers explicitly setting other sources to zero. So there should be some guiding principles there, the above is just divided up in a quick and dirty fashion without thinking much about that particular issue.

Lists for members of a household, household-level variables get a single entry. Will be a mess for 15-person households, of course, but then we could allow for dicts instead of lists and give members names.

I'd like a clear key for a testcase instead of lists. We can parametrize tests s.t. they are shown during execution, so it will be easy to see which ones are failing.

Tbh, I am not even sure about the year as the outermost key. We could should have one file per test from a given source, maybe have subdirectories per year? Not sure about the ideals structure there.

Info: The important bit is that it is trivial to retrieve the example from it. Visiting the website based on the first example, it took too much time.

In any case, we should think mostly about how we want to add new test cases in the future, less about the structure of current test cases. That will be a one-time effort to convert them.

Thanks!!!

ChristianZimpelmann commented 2 years ago

I like your changes. And I agree that grouping by year is actually not necessary. It stems from they way tests are run at the moment, but it makes sense to change it.

We could then include a new key jahr as follows:


bpb_altenköster:
  jahr: 2017
  inputs_given:
    p_id: [0, 1]
    tu_id: [0, 0]
    ...

hmgaudecker commented 2 years ago

(aside: triple ` [language] gives you syntax highlighting)

I think we do want to group by year (tests don't make sense without reference to that, right?) but probably even at the directory level? What would the files be called ideally? Given my limited understanding of the tests, I would favor

test_data/[arbeitsl_geld_2-eink_st-...]/[year]/[source].yaml

So the above would be:

test_data/grunds_im_alter/2017/bpb.yaml

and the key of the test altenköster ?

ChristianZimpelmann commented 2 years ago

That would also be an option.

I think the question is whether tests in a new year are more often:

based on new sources/test_cases: Then a new directory makes much sense
based on an "old" test for which we modify the output ourselves: Then it would be more convenient to have it in the same file

I wouldn't expect the Rentenversicherung to produce new examples for the calculation of Grundrente if after a few years the parameters change slightly. It is different for transfers for which we find a calculator online.

I am leaning to the "years in directory" solution, but not fully sure yet.

hmgaudecker commented 2 years ago

based on an "old" test for which we modify the output ourselves: Then it would be more convenient to have it in the same file

I would say no: Much easier to copy/paste and then diff two near-identical files than looking at two portions of the same file. Duplication of most things will be there no matter what.

ChristianZimpelmann commented 2 years ago

@LauraGergeleit, would be great if you can start rewriting test_eink_st and the respective test data as proposed above. Let us know if anything is unclear.

Then we can all have a look and see whether we would like to improve the template in any way.

ChristianZimpelmann commented 2 years ago

While working on this, we can also address #336 .

hmgaudecker commented 2 years ago

Just looking at it again, I think we can give this a shot as described!

Small adjustments to what I would imagine [sub_dir]/2017/bpb.yaml to look like:

altenköster:
  inputs:
    provided:
      p_id: [0, 1]
      tu_id: [0, 0]
      hh_id: 0
      alter: [72, 67]
      bruttokaltmiete_m_hh: 460.0
      heizkosten_m_hh: 60.0
      ges_rente_m: [860.6, 4]
      rentner: [true, true]
      schwerbe_ausweis_g: [false, false]
      alleinerziehend: [false, false]
      bruttolohn_m: [0, 0]
      sonstig_eink_m: [0, 0]
      eink_selbst_m: [0, 0]
    assumed:
      arbeitsl_geld_m: [0, 0]
      bewohnt_eigentum_hh: false
      eink_st_tu: [0, 0]
      elterngeld_m: [0, 0]
      grundrentenzeiten: [300, 48]
      kapital_eink_m: [0, 0]
      kind: [false, false]
      kindergeld_m_hh: 0
      prv_rente_m: [0, 0]
      soli_st_tu: [0, 0]
      sozialv_beitr_m: [0, 0]
      unterhaltsvors_m: [0, 0]
      vermiet_eink_m: [0, 0]
      vermögen_hh: 0
      wohnfläche_hh: 60
  outputs:
    grunds_im_alter_m_hh: 322.0
    regelbedarf_m_grunds_im_alter_vermögens_check_hh: 1256.0
  info:
    note: Beispiel Ehepaar Altenköster. Warmmiete split up in arbitrary fashion.
    source: https://www.bpb.de/politik/innenpolitik/rentenpolitik/289395/leistungshoehe-und-fallbeispiele

That is, avoid the need to parse inputs_x, rather have a nested dict there.

ChristianZimpelmann commented 2 years ago

I haven't understood the new proposal yet.

Am I right that everything from p_id: [0, 1] on should be indented one level less?
Why not assumed instead of inputs_assumed?
The outer-level key of the test is probably important to add more test cases in the same file (altenköster)

hmgaudecker commented 2 years ago

Sorry, I had messed that one up, corrected it now. All valid points, should be irrelvant now.

LauraGergeleit commented 1 year ago

I had another look at how to convert the test data from csv to yaml-files and included your remarks

This is the list I created for the example of arbeitsl_geld for the cases in the year 2015:

[{'info': {'source': 'none',
   'note': 'old test data - anwartschaftszeit, arbeitssuchend, m_durchg_alg1_bezug and soz_vers_pflicht_5j were added manually'},
  'inputs': {'provided': {'hh_id': [5, 5],
    'tu_id': [5, 5],
    'p_id': [7, 8],
    'bruttolohn_vorj_m': [7000, 0],
    'wohnort_ost': [True, True],
    'kind': [False, True],
    'arbeitsstunden_w': [0, 0],
    'anz_kinder_tu': [1, 0],
    'alter': [30, 5],
    'geburtsjahr': [1985, 1985],
    'jahr': [2015, 2015],
    'eligible': [True, False],
    'alg_wage': [5200, 0],
    'alg_ssc': [1092.0, 0.0],
    'alg_tax': [1460.56, 0.0],
    'alg_soli': [80.33, 0.0],
    'alg_entgelt': [2567.11, 0.0]},
   'assumed': {'anwartschaftszeit': [True, False],
    'arbeitssuchend': [True, False],
    'm_durchg_alg1_bezug': [0, 0],
    'soz_vers_pflicht_5j': [12, 0]}},
  'outputs': {'outputs': {'arbeitsl_geld_m': [1719.96, 0.0]}}}]

Converting the list into a yaml-file gives the following output. The inputs in square brackets is transformed into this list with bullet points. Is there a way to keep the syntax with the square brackets?

-   info:
        source: none
        note: old test data - anwartschaftszeit, arbeitssuchend, m_durchg_alg1_bezug and soz_vers_pflicht_5j
            were added manually
    inputs:
        provided:
            hh_id:
            - 5
            - 5
            tu_id:
            - 5
            - 5
            p_id:
            - 7
            - 8
            bruttolohn_vorj_m:
            - 7000
            - 0
            wohnort_ost:
            - true
            - true
            kind:
            - false
            - true
            arbeitsstunden_w:
            - 0
            - 0
            anz_kinder_tu:
            - 1
            - 0
            alter:
            - 30
            - 5
            geburtsjahr:
            - 1985
            - 1985
            jahr:
            - 2015
            - 2015
            eligible:
            - true
            - false
            alg_wage:
            - 5200
            - 0
            alg_ssc:
            - 1092.0
            - 0.0
            alg_tax:
            - 1460.56
            - 0.0
            alg_soli:
            - 80.33
            - 0.0
            alg_entgelt:
            - 2567.11
            - 0.0
        assumed:
            anwartschaftszeit:
            - true
            - false
            arbeitssuchend:
            - true
            - false
            m_durchg_alg1_bezug:
            - 0
            - 0
            soz_vers_pflicht_5j:
            - 12
            - 0
    outputs:
        outputs:
            arbeitsl_geld_m:
            - 1719.96
            - 0.0

hmgaudecker commented 1 year ago

Converting the list into a yaml-file gives the following output. The inputs in square brackets is transformed into this list with bullet points. Is there a way to keep the syntax with the square brackets?

We explicitly set the yaml-style to not have them, don't worry.

I like that much better. We should then also parametrize the test cases differently, so that each individual / tu / hh becomes a separate case. But that can be a second PR.

@lars-reimann, maybe you can have a look into this together with Laura, too?

lars-reimann commented 1 year ago

We should then also parametrize the test cases differently, so that each individual / tu / hh becomes a separate case. But that can be a second PR.

@lars-reimann, maybe you can have a look into this together with Laura, too?

I'll check it out.

lars-reimann commented 1 year ago

Closed by #553.

iza-institute-of-labor-economics / gettsim

Restructure test_data #282

Current and desired situation

Proposed implementation

Considered alternatives