Closed ChristianZimpelmann closed 1 year ago
In particular, think about structured way of including metadata on the source of the test data:
Specifically, I would like to be able to document what is the exact source for a test case, which variables and values are given in that source and which variables are manually calculated or assumed to bring these external examples into shape for GETTSIM.
@LauraGergeleit and me were trying around how we could represent tests in a yaml
-file.
One test case (for two individuals in the same household) could look as follows:
2017:
- individuals:
- inputs:
alleinerziehend: false
alter: 72
arbeitsl_geld_m: 0
bewohnt_eigentum_hh: false
bruttokaltmiete_m_hh: 460.0
bruttolohn_m: 0
eink_selbst_m: 0
eink_st_tu: 0
elterngeld_m: 0
grundrentenzeiten: 300
heizkosten_m_hh: 60.0
kapital_eink_m: 0
kind: false
kindergeld_m_hh: 0
prv_rente_m: 0
rentner: true
schwerbe_ausweis_g: false
soli_st_tu: 0
sonstig_eink_m: 0
sozialv_beitr_m: 0
staatl_rente_m: 860.6
unterhaltsvors_m: 0
vermiet_eink_m: 0
vermögen_hh: 0
wohnfläche_hh: 60
outputs:
grunds_im_alter_m_hh: 322.0
regelbedarf_m_grunds_im_alter_vermögens_check_hh: 1256.0
- inputs:
alleinerziehend: false
alter: 67
arbeitsl_geld_m: 0
bewohnt_eigentum_hh: false
bruttokaltmiete_m_hh: 460.0
bruttolohn_m: 0
eink_selbst_m: 0
eink_st_tu: 0
elterngeld_m: 0
grundrentenzeiten: 48
heizkosten_m_hh: 60.0
kapital_eink_m: 0
kind: false
kindergeld_m_hh: 0
prv_rente_m: 0
rentner: true
schwerbe_ausweis_g: false
soli_st_tu: 0
sonstig_eink_m: 0
sozialv_beitr_m: 0
staatl_rente_m: 73.4
unterhaltsvors_m: 0
vermiet_eink_m: 0
vermögen_hh: 0
wohnfläche_hh: 60
outputs:
grunds_im_alter_m_hh: 322.0
regelbedarf_m_grunds_im_alter_vermögens_check_hh: 1256.0
info:
columns_from_source:
- alleinerziehend
- eink_selbst_m
- bruttolohn_m
note: Some space for a note
source: https://www.bpb.de/politik/innenpolitik/rentenpolitik/289395/leistungshoehe-und-fallbeispiele
The yaml would be structured as follows:
info
including which columns are taken from source, a link to the source, and a note (we should think more deeply about which information is relevant here)What do you think? If you think it goes in the right direction, I would start a GEP and go more into detail.
Questions I am asking myself at the moment:
yaml
file can become quite long as the definition of each individual takes 30+ lines. csv
-files were more efficient in that sense. But I would still prefer the yaml version.Alternative way to represent columns_from_source
: split up list inputs
into inputs_from_source
, inputs_implied_by_us
(and maybe inputs_from_source_adjusted_by_us
)
@mjbloemer @hmgaudecker
Thanks, great start!
Based on that and some thinking, I believe I'd prefer something like the following:
2017:
bpb_altenköster:
inputs_given:
p_id: [0, 1]
tu_id: [0, 0]
hh_id: 0
alter: [72, 67]
bruttokaltmiete_m_hh: 460.0
heizkosten_m_hh: 60.0
ges_rente_m: [860.6, 4]
rentner: [true, true]
schwerbe_ausweis_g: [false, false]
alleinerziehend: [false, false]
bruttolohn_m: [0, 0]
sonstig_eink_m: [0, 0]
eink_selbst_m: [0, 0]
inputs_assumed:
arbeitsl_geld_m: [0, 0]
bewohnt_eigentum_hh: false
eink_st_tu: [0, 0]
elterngeld_m: [0, 0]
grundrentenzeiten: [300, 48]
kapital_eink_m: [0, 0]
kind: [false, false]
kindergeld_m_hh: 0
prv_rente_m: [0, 0]
soli_st_tu: [0, 0]
sozialv_beitr_m: [0, 0]
unterhaltsvors_m: [0, 0]
vermiet_eink_m: [0, 0]
vermögen_hh: 0
wohnfläche_hh: 60
outputs:
grunds_im_alter_m_hh: 322.0
regelbedarf_m_grunds_im_alter_vermögens_check_hh: 1256.0
info:
note: Beispiel Ehepaar Altenköster. Warmmiete split up in arbitrary fashion.
source: https://www.bpb.de/politik/innenpolitik/rentenpolitik/289395/leistungshoehe-und-fallbeispiele
That is, one entry per household and separate what we get from the example and what we assume (valid keys there would be inputs
or inputs_given
(+ inputs_assumed
), exclusive or).
Separating inputs_given
and inputs_assumed
will not always be fully obvious. E.g., in the above example, it seems clear that these people's only income is the pension (aside -- staatl_rente
is confusing me each time again. Please get rid of it everywhere asap. It is just plain wrong. Apologies and thank you!), even though there are no numbers explicitly setting other sources to zero. So there should be some guiding principles there, the above is just divided up in a quick and dirty fashion without thinking much about that particular issue.
Lists for members of a household, household-level variables get a single entry. Will be a mess for 15-person households, of course, but then we could allow for dicts instead of lists and give members names.
I'd like a clear key for a testcase instead of lists. We can parametrize tests s.t. they are shown during execution, so it will be easy to see which ones are failing.
Tbh, I am not even sure about the year as the outermost key. We could should have one file per test from a given source, maybe have subdirectories per year? Not sure about the ideals structure there.
Info: The important bit is that it is trivial to retrieve the example from it. Visiting the website based on the first example, it took too much time.
In any case, we should think mostly about how we want to add new test cases in the future, less about the structure of current test cases. That will be a one-time effort to convert them.
Thanks!!!
I like your changes. And I agree that grouping by year is actually not necessary. It stems from they way tests are run at the moment, but it makes sense to change it.
We could then include a new key jahr
as follows:
bpb_altenköster:
jahr: 2017
inputs_given:
p_id: [0, 1]
tu_id: [0, 0]
...
(aside: triple ` [language] gives you syntax highlighting)
I think we do want to group by year (tests don't make sense without reference to that, right?) but probably even at the directory level? What would the files be called ideally? Given my limited understanding of the tests, I would favor
test_data/[arbeitsl_geld_2-eink_st-...]/[year]/[source].yaml
So the above would be:
test_data/grunds_im_alter/2017/bpb.yaml
and the key of the test altenköster
?
That would also be an option.
I think the question is whether tests in a new year are more often:
I wouldn't expect the Rentenversicherung to produce new examples for the calculation of Grundrente if after a few years the parameters change slightly. It is different for transfers for which we find a calculator online.
I am leaning to the "years in directory" solution, but not fully sure yet.
based on an "old" test for which we modify the output ourselves: Then it would be more convenient to have it in the same file
I would say no: Much easier to copy/paste and then diff two near-identical files than looking at two portions of the same file. Duplication of most things will be there no matter what.
@LauraGergeleit, would be great if you can start rewriting test_eink_st
and the respective test data as proposed above. Let us know if anything is unclear.
Then we can all have a look and see whether we would like to improve the template in any way.
While working on this, we can also address #336 .
Just looking at it again, I think we can give this a shot as described!
Small adjustments to what I would imagine [sub_dir]/2017/bpb.yaml
to look like:
altenköster:
inputs:
provided:
p_id: [0, 1]
tu_id: [0, 0]
hh_id: 0
alter: [72, 67]
bruttokaltmiete_m_hh: 460.0
heizkosten_m_hh: 60.0
ges_rente_m: [860.6, 4]
rentner: [true, true]
schwerbe_ausweis_g: [false, false]
alleinerziehend: [false, false]
bruttolohn_m: [0, 0]
sonstig_eink_m: [0, 0]
eink_selbst_m: [0, 0]
assumed:
arbeitsl_geld_m: [0, 0]
bewohnt_eigentum_hh: false
eink_st_tu: [0, 0]
elterngeld_m: [0, 0]
grundrentenzeiten: [300, 48]
kapital_eink_m: [0, 0]
kind: [false, false]
kindergeld_m_hh: 0
prv_rente_m: [0, 0]
soli_st_tu: [0, 0]
sozialv_beitr_m: [0, 0]
unterhaltsvors_m: [0, 0]
vermiet_eink_m: [0, 0]
vermögen_hh: 0
wohnfläche_hh: 60
outputs:
grunds_im_alter_m_hh: 322.0
regelbedarf_m_grunds_im_alter_vermögens_check_hh: 1256.0
info:
note: Beispiel Ehepaar Altenköster. Warmmiete split up in arbitrary fashion.
source: https://www.bpb.de/politik/innenpolitik/rentenpolitik/289395/leistungshoehe-und-fallbeispiele
That is, avoid the need to parse inputs_x
, rather have a nested dict there.
I haven't understood the new proposal yet.
p_id: [0, 1]
on should be indented one level less?assumed
instead of inputs_assumed
?altenköster
)Sorry, I had messed that one up, corrected it now. All valid points, should be irrelvant now.
I had another look at how to convert the test data from csv to yaml-files and included your remarks
This is the list I created for the example of arbeitsl_geld
for the cases in the year 2015:
[{'info': {'source': 'none',
'note': 'old test data - anwartschaftszeit, arbeitssuchend, m_durchg_alg1_bezug and soz_vers_pflicht_5j were added manually'},
'inputs': {'provided': {'hh_id': [5, 5],
'tu_id': [5, 5],
'p_id': [7, 8],
'bruttolohn_vorj_m': [7000, 0],
'wohnort_ost': [True, True],
'kind': [False, True],
'arbeitsstunden_w': [0, 0],
'anz_kinder_tu': [1, 0],
'alter': [30, 5],
'geburtsjahr': [1985, 1985],
'jahr': [2015, 2015],
'eligible': [True, False],
'alg_wage': [5200, 0],
'alg_ssc': [1092.0, 0.0],
'alg_tax': [1460.56, 0.0],
'alg_soli': [80.33, 0.0],
'alg_entgelt': [2567.11, 0.0]},
'assumed': {'anwartschaftszeit': [True, False],
'arbeitssuchend': [True, False],
'm_durchg_alg1_bezug': [0, 0],
'soz_vers_pflicht_5j': [12, 0]}},
'outputs': {'outputs': {'arbeitsl_geld_m': [1719.96, 0.0]}}}]
Converting the list into a yaml-file gives the following output. The inputs in square brackets is transformed into this list with bullet points. Is there a way to keep the syntax with the square brackets?
- info:
source: none
note: old test data - anwartschaftszeit, arbeitssuchend, m_durchg_alg1_bezug and soz_vers_pflicht_5j
were added manually
inputs:
provided:
hh_id:
- 5
- 5
tu_id:
- 5
- 5
p_id:
- 7
- 8
bruttolohn_vorj_m:
- 7000
- 0
wohnort_ost:
- true
- true
kind:
- false
- true
arbeitsstunden_w:
- 0
- 0
anz_kinder_tu:
- 1
- 0
alter:
- 30
- 5
geburtsjahr:
- 1985
- 1985
jahr:
- 2015
- 2015
eligible:
- true
- false
alg_wage:
- 5200
- 0
alg_ssc:
- 1092.0
- 0.0
alg_tax:
- 1460.56
- 0.0
alg_soli:
- 80.33
- 0.0
alg_entgelt:
- 2567.11
- 0.0
assumed:
anwartschaftszeit:
- true
- false
arbeitssuchend:
- true
- false
m_durchg_alg1_bezug:
- 0
- 0
soz_vers_pflicht_5j:
- 12
- 0
outputs:
outputs:
arbeitsl_geld_m:
- 1719.96
- 0.0
Converting the list into a yaml-file gives the following output. The inputs in square brackets is transformed into this list with bullet points. Is there a way to keep the syntax with the square brackets?
We explicitly set the yaml-style to not have them, don't worry.
I like that much better. We should then also parametrize the test cases differently, so that each individual / tu / hh becomes a separate case. But that can be a second PR.
@lars-reimann, maybe you can have a look into this together with Laura, too?
We should then also parametrize the test cases differently, so that each individual / tu / hh becomes a separate case. But that can be a second PR.
@lars-reimann, maybe you can have a look into this together with Laura, too?
I'll check it out.
Closed by #553.
Current and desired situation
Input data for tests in
tests/test_data
are currently defined in .csv and .ods files.Only one data format, which does not allow for formulas, but can be labelled
Proposed implementation
yaml
Considered alternatives
DataFrames in pickle