Datatype of params - Githubissues

MaxBlesch commented 4 years ago

We need to discuss how to move away from excel sheets.

Eric-Sommer commented 4 years ago

what exactly bothers you?

MaxBlesch commented 4 years ago

The thing with excel sheets is the conflict of bool variables if you open it in other editors. If we want to have community code, which is editable by a large community we need something like csv. We haven't yet agreed on one format or structure yet. Will keep this updated, as soon we have.

Eric-Sommer commented 4 years ago

Right, I remember. The benefit of spreadsheets is their ability to work with formulas. Our tests are non-trivial. One could stick to 0/1 variables altogether and possibly reformat them within the test routine.

janosg commented 4 years ago

Another benefit of not having spreadsheets is version control. I think it was important to have the formulae while debugging the test-cases, but now it's just important to have the correct numbers. If it's about documentation, then the formulae should go in docstrings or doctests.

Eric-Sommer commented 4 years ago

What bothers me more right now is that my LibreOffice crashes everytime I want to save changes to the param file. What about switching to ods format all the way and using pandas-ods-reader? Versioning could be achieved by converting spreadsheets to csv via script.

mjbloemer commented 4 years ago

A few notes on this matter from my perspective as the maintainer of the Stata based microsimulation model (forked from the ZEW fork of izamod two years ago) at ifo.

At least in the context of the Stata version of the taxtransfer model I fully agree that moving away from binary xls files is/was beneficial.
It might be good to move a step further and get rid of the year paradigm.

On 1: At ifo we got rid of the binary param.xls a while ago replacing it with csv files for every year. While this change seemed subtile at first, it made a huge difference for our workflow especially with version control and multiple developers. The binary xls file was too error-prone. We also briefly tested one big csv file (like it was in the early days of izamod btw) as well as text based xlsx and LibreOffice files and thought about xml or json based databases fo the parameters.

For us, this change was needed to allow less experienced developers (research assistants) to submit meaningful merge requests that can easily be reviewed. Furthermore, it allows us to easily diff two or more years.

The csv files for each year have three columns param,value,note where note is filled if there is a change. Example for one line of the rs2018.csv (it is much more readable in a proper text editor with csv "syntax" highlighting): rs_hhvor,416,"RBSFV 2018, V. v. 08.11.2017 BGBl. I S. 3767 (Nr. 73). Geltung ab 01.01.2018."

We also keep a meta file with columns param,group,shortname,longname,pnote. Example for one line: a2an1,"ALG II","EK-Anteil anrechnungsfrei Intervall 1","Einkommensanteil, der anrechnungsfrei bleibt (Intervall: [a2grf, a2eg1]) § 11b (4). Nr. 1 SGB II. Wirknorm: § 11b III Nr. 1 SGB II. Seit 2005."

We also have a paraminfo.do script that replicates the old param.xls on demand and also generates a full fledged latex based parameter documentation with nice graphs and annotations based on the information in rsyyyy.csv and meta.csv. Screenshot of one page: Screenshot 2019-10-08 17-30-01

If you are interested, I can send you (a large) patch file for the Stata version of the model.

On 2: Getting rid of the annual paradigm. Some disadvantages of the current year-based system:

Some parameters change during the year and sometimes you need the legislation at a specific time stamp during the year.
Some parameters almost never change.
A structural change the the tax and transfer system traditionally involves another if clause for a new year or period in the taxtransfer file. However the difference between a parameter change and a structure change is blurry.

As a result we tested the move of (some) parameters to the taxtransfer file. While we did not complete this because it was a low priority I still have the impression that it does not hinder but improve the readability of the taxtransfer module. Especially if the tax and transfer laws can be modeled in separate modules or functions I think this can be beneficial.

Eric-Sommer commented 4 years ago

I am very sympathetic to everything you mentioned in 1), especially that the params document themselves. Are you by any chance keen on practicing some python/matploblib? ;) How do you treat non-existent parameters (like alg2 stuff pre 2005)? Are they missing or non-existent?

Not sure about dropping the year paradigm. After all, you get the full set of parameters for a given year (1st Jan). Regarding your third point (structural changes versus parameter changes): we insert the respective functions into the param dictionary, which is pretty much equivalent to inserting them into the param files in the first place.

mjbloemer commented 4 years ago

How do you treat non-existent parameters (like alg2 stuff pre 2005)? Are they missing or non-existent?

Both is possible and in both cases Stata will not create a local macro for that parameter/year. Currently I keep lines with missing values (for no specific reason). E.g. rs1984.csv starts:

param,value,note
a2an1
a2an2
...

But it is perfectly fine to remove these lines in csv files/years when a parameter is not defined.

Maybe Stata/izamod is a bit off topic here... I will send you a patch file for izamod tomorrow.

Not sure about dropping the year paradigm. After all, you get the full set of parameters for a given year (1st Jan).

I guess the year paradigm is a good approximation. I do not push to change this. Just some thoughts:

Sometimes I specifically need Q1, Q3, end of year (actually in most cases 31dec) and so on. In these cases I have to look again in the legislation and overwrite the default values of parameters. No big deal, but maybe not best practice.
Also a bit more documentation is involved and more considerations have to be done if parameters/values can not begin/end intra year. Currently most parameters are based on the value that is predominant during a year. Still, the research assistants keep asking which value to choose when they research new legislation.
I also have that application in mind when a researcher wants to calculate a social benefit at a specific date during the year e.g. at the month of the interview of a survey which might vary during a year/wave across households.

Eric-Sommer commented 4 years ago

Thanks @mjbloemer. If no-one objects, could you transfer the params to csv as described above?

Eric-Sommer commented 4 years ago

Meanwhile, one could insert a call of excel_to_csv(test_path_folder) into the pre-commit hook. Is this possible/desirable (@tobiasraabe)?

import pandas as pd
from pandas.testing import assert_frame_equal
import os

def excel_to_csv(path):
    """reads all excel files in the path and saves them as 
    comma separated values in the same folder with an identical name.

    path (str): Folder where excel files are stored"""

    for file in os.listdir(path):
        if file.endswith(".xls") or file.endswith(".xlsx"):
            df = pd.read_excel(os.path.join(path, file))
            out_file = os.path.join(path, file.split('.')[0] + '.csv')
            df.to_csv(os.path.join(path, out_file), index=False)
            df2 = pd.read_csv(os.path.join(path, out_file))
            try:
                assert_frame_equal(df, df2)
            except AssertionError:
                print("Attention! Possible problem in converting to csv: {}".format(out_file))

return

tobiasraabe commented 4 years ago

@Eric-Sommer Why does the conversion have to be repeated for every commit? As I understand it, it is a one-shot thing.

I read everything, but I think I need a little bit more context, though, it is probably not related to the conversion issue, but to the broader representation of the parameters. So, params contains the parameters of the tax and transfer system meaning the level of tax exempt amounts, child benefits, etc. for every year, right? And the issue is whether to split it into multiple files for every year instead of maintaining one big file, right? And the last thing to keep in mind is that year-based legislation is only an approximation to legislative changes occurring in some quarter of the year.

Eric-Sommer commented 4 years ago

@Eric-Sommer Why does the conversion have to be repeated for every commit? As I understand it, it is a one-shot thing.

I know how difficult it can be to set up a correct test. Therefore I'm in favor of maintaining the spreadsheets (odt rather than xlsx), but at the same time make changes visible via csv files. We could as well do it only once as soon as we feel we have an extensive test coverage. This is not yet the case, and future issues will require new/refined tests (e.g. #30, #22, #13, #10).

And the issue is whether to split it into multiple files for every year instead of maintaining one big file, right?

Correct, along with a note-file containing explanations on the parameters. Plus, they will be in .csv for versioning reasons.

mjbloemer commented 4 years ago

And the issue is whether to split it into multiple files for every year instead of maintaining one big file, right?

beside getting rid of binary xls file for multiple reasons (reliability, version control...), this is one aspect.

(Again, talking about our Stata based model:) Having multiple csv files - one for every year - also facilitated adding documentation strings for values in a specific year (law citations, explanations, sources, notes...) and made them easier read in plain text and process them for our parameter documentation. Note, the old param.xls has/had cell annotations which were hard to read and impossible to process.

mjbloemer commented 4 years ago

Related to this and some points raised above, for example on how you handle parameters here: https://github.com/iza-institute-of-labor-economics/gettsim/blob/9bf3334c742f6c0dfc8e31d2b975aa5ddd5aa70d/gettsim/benefits/kiz.py#L136-L200

How do you (plan to) handle value documentation here?

In what instances do you store parameters in a separate parameter file and when do you store parameters next to the calculation like here?

parameters can be

monetary values and parameters based on monetary values, a. some change often (benefits, taxes, "Wohnbedarf Anteil KiZ"...), b. some parameters do not change often ("Soli", "Werbungskostenpauschbetrag", but also marginal employment thresholds...)
independent of any real value such as fractions (e.g. "EK-Anteil anrechnungsfrei bei ALG II")

1a would be natural candidates for the param file(s) while 1b and 2 could be defined in or next to calculations/functions. It seems a bit strange to define a value that never changes again every year instead of defining them for a period (of years).

If only few parameters of type 1a are left for the param files it might be worth to get rid of param files altogether.

tobiasraabe commented 4 years ago

The split is not really necessary if you do not have a column for each year, but one year column and a year for each parameter. Of course, you get duplicated parameter names, but columns can be reserved for any type of information and you still have everything in one file.

Regarding your new comment, I think it is a bad idea to store some parameters next to functions and some in a file which leads to a huge mess and not a central space for all definitions. We should move them all in one central space.

The main idea behind params should be: What is the easiest way for users to specify parameters. It does not matter what we use internally as we will derive it from the user input. So, from the aforementioned examples, I assume every parameter is identified by a name and the duration (years). Then, this could be the index of the new params. Further information can be added as new columns.

Eric-Sommer commented 4 years ago

@mjbloemer: I agree this is inconsistent practice. We'd also need to incroporate the pension parameters. They are again somewhat different as they contain empirical values such as the number of pension contribution payers.

@tobiasraabe: do you have in mind sth like this?

tb = pd.DataFrame({'param': ['alg2_regelsatz', 'alg2_regelsatz', 'alg2_regelsatz'],
               'year_start': [1984, 2005, 2006],
               'year_end': [2004, 2005, 2006],
               'value': [np.nan, 338, 345]}
               )

I find this a bit messy and opaque to be honest.

tobiasraabe commented 4 years ago

You can omit the first entry by using sensible defaults. Setting the first three columns as the index improves readability and makes it extremely easy to filter values. You can append further columns which reference the legislative text, etc..

I do not argue that this is the best way possible, but it solves all issues raised by @mjbloemer. What is exactly opaque? I have not contributed any legislation to gettsim, so it might be useful if you explain your process and how this does not fit your needs.

Eric-Sommer commented 4 years ago

If I'm debugging (or writing tests), I'd like to know e.g. the value of child tax credit (kifreib) for 2011. Since this was constant from 2010 to 2013, it does not become apparent immediately. Also keep in mind we will need adjusted params for reforms which will need to fit the base structure. Anything beyond a 2-column table creates higher costs for people setting up new reforms.

mjbloemer commented 4 years ago

tb = pd.DataFrame({'param': ['alg2_regelsatz', 'alg2_regelsatz', 'alg2_regelsatz'],
               'year_start': [1984, 2005, 2006],
               'year_end': [2004, 2005, 2006],
               'value': [np.nan, 338, 345]}
               )

I find this a bit messy and opaque to be honest.

This is not easy to read in plain text as columns do not align, especially if you introduce cells for annotations.

tobiasraabe commented 4 years ago

Sure it is not. I assumed it is only a representation of params loaded in Python. You would define it in a csv like this

name,year_start,year_end,value,description
alg2_regelsatz,2005,2005,338,Regelsatz bei ALG II
alg2_regelsatz,2006,2006,345,Regelsatz bei ALG II

You edit this in your text editor or use excel. Jupyter has an embedded csv editor which aligns columns.

mjbloemer commented 4 years ago

Looks good to me.

description would be a documentation regarding the introduction/change of that parameter.

time invariant parameter documentation would be done in a meta file then?

tobiasraabe commented 4 years ago

You mentioned the Soli. It would just have a longer time frame. Or we could make every parameter which does not have start and end dates time invariant. Parameters without end dates apply to all future periods.

name,year_start,year_end,value,description
alg2_regelsatz,2005,2005,338,Regelsatz bei ALG II
alg2_regelsatz,2006,2006,345,Regelsatz bei ALG II
soli,0.055,1991,2019,Solidaritätszuschlag
schaumwein_steuer,0.5,1902,,Schaumweinsteuer

Eric-Sommer commented 4 years ago

Looks good to me.

description would be a documentation regarding the introduction/change of that parameter.

time invariant parameter documentation would be done in a meta file then?

I agree, it's better to note what and where things were changed (BGBl and stuff). This is what the Excel comments do now. The explanation on what the parameter is about could be outsourced.

So can we agree on the following for the Parameters:

one param.csv as @tobiasraabe just sketched above: name, year_start, year_end, value and note on the change.
one param_description.csv containing the columns: name, description_DE, description_EN, source, and potentially a short identifier which realm it belongs to (child benefit, ALG2, income tax).

mjbloemer commented 4 years ago

Sounds good to me. The short identifier could resemble the file names in taxes/ and benefits/.

tobiasraabe commented 4 years ago

Sorry for being nitpicky :). Why do you think it is necessary to have the description separated from the parameters?

mjbloemer commented 4 years ago

Because there are two description types. Those for the parameter itself (name, law) which are constant over time; no need to state them again. And those for the specific value, i.e. the description of the change/introduction (law that changes the law, source, calculation details, notes) which change every time a parameter changes.

hmgaudecker commented 4 years ago

Sorry to come late to the discussion. Somehow it slipped from my attention. Thanks for the thoughtful discussions!

But strong opinions, of course, despite little time right now :-). Sorry about that!

Can we please keep the discussions of test data and parameters separate? I think they have fairly little to do with each other, other than the fact that we want to avoid binary formats in the long run.

For the test data, I think we want to move to a much more fine-grained thing that is far away from the current situation. Ideally, we want to keep most stuff in the py-files themselves.
The parameters will be crucial. We will have much more detailed discussions on the pros and cons of various options in the future. The choice will also depend on how we will structure the overall codebase. How we split the monstrous xls-file, whether along the time-dimension or by tax/transfer, will be subject to change.
So in the code, we should be sure just to have one central place to do I/O.
For now, I would be fine with anything that means not losing any information.
For that reason would also be great if we could include precise dates as opposed to years only. What if somebody wants to use it for an RD design for Elterngeld+, which to the best of my recollection was introduced 1 July 2015? I would like to support such use cases in the long run and it seems simpler to keep things around from the beginning rather than going back later.

I am not sure I see the point of converting one binary format for another in #40. Last time I checked, Excel did not even know what an ods-file was...

Eric-Sommer commented 4 years ago

I am not sure I see the point of converting one binary format for another in #40. Last time I checked, Excel did not even know what an ods-file was...

Well, by now it does. #40 is just a way minimum-invasive way to get rid of the compatibility problem mentioned above (third post) by @MaxBlesch.

hmgaudecker commented 4 years ago

Fair enough. No objections then, if this does not slow us down in moving away from these things altogether. No formulas should be needed in tests, eventually.

mjbloemer commented 4 years ago

Can we please keep the discussions of test data and parameters separate?

This issue is mostly on the params - now, sorry ;). On the param discussion:

For now, I would be fine with anything that means not losing any information.

If that refers to the process of moving from xls to csv, it should be enough to compare result files after migration? In our case, we had a simple script that did a one time conversion to csv(s) and we ran some tests to check that there is no change in results in any year (basically a simple diff of key result files - which we usually always run in the review process of a merge request btw).

If the param.xls is regarded as a nice to have overview, just create a simple script that replicates the xls (or better, a nice pdf documentation) on demand like I mentioned before.

I am not sure I see the point of converting one binary format for another in #40.

agree (for param). Beside versioning issues and reliability of binary formats they effectively add a full GUI office suite as a dependency to gettsim.

For that reason would also be great if we could include precise dates as opposed to years only. What if somebody wants to use it for an RD design for Elterngeld+, which to the best of my recollection was introduced 1 July 2015? I would like to support such use cases in the long run and it seems simpler to keep things around from the beginning rather than going back later.

I guess with the year paradigm this could only go to the notes/documentation.

hmgaudecker commented 4 years ago

Can we please keep the discussions of test data and parameters separate?

This issue is mostly on the params - now, sorry ;). On the param discussion:

Thanks for the clarification -- I did not manage to read every post in detail yet.

For now, I would be fine with anything that means not losing any information.

If that refers to the process of moving from xls to csv, it should be enough to compare result files after migration? In our case, we had a simple script that did a one time conversion to csv(s) and we ran some tests to check that there is no change in results in any year (basically a simple diff of key result files - which we usually always run in the review process of a merge request btw).

Ensuring consistency is of course of even higher importance. I assumed this would be the case anyhow :-)

I mostly meant stuff like the years vs dates below, references to laws, etc. Any information that we currently have should not disappear in the move (in the sense of coarsening or losing it altogether) and if somebody needs to look up a law, he or she should add the precise date if we only have a year etc.

If the param.xls is regarded as a nice to have overview, just create a simple script that replicates the xls (or better, a nice pdf documentation) on demand like I mentioned before.

I am not sure I see the point of converting one binary format for another in #40.

agree (for param). Beside versioning issues and reliability of binary formats they effectively add a full GUI office suite as a dependency to gettsim.

+1

Though see above -- I do not have any objections if this solves pain in the short run at low (now zero?) cost and we will need to have a serious discussion - if possible in-person - at some point anyhow.

For that reason would also be great if we could include precise dates as opposed to years only. What if somebody wants to use it for an RD design for Elterngeld+, which to the best of my recollection was introduced 1 July 2015? I would like to support such use cases in the long run and it seems simpler to keep things around from the beginning rather than going back later.

I guess with the year paradigm this could only go to the notes/documentation.

For now, we should keep the information around in whatever params file we use. We can always discard it when running the current code. I.e., for the moment we would first check whether a date can be parsed as YYYY-MM-DD and otherwise use YYYY. I just want to avoid checking such stuff multiple times because we cannot handle the more precise version for the moment.

hmgaudecker commented 4 years ago

We also have a paraminfo.do script that replicates the old param.xls on demand and also generates a full fledged latex based parameter documentation with nice graphs and annotations based on the information in rsyyyy.csv and meta.csv. Screenshot of one page:

If you are interested, I can send you (a large) patch file for the Stata version of the model.

This sort of documentation would be extremely cool!!!

Can you maybe just open a PR with the Stata code so it does not end up being forgotten and we will take it from there?

mjbloemer commented 4 years ago

Here you go: https://gist.github.com/mjbloemer/834409a80758e5354a8298b67ba52968

It's just a little script to get the xls and pdf from the individual csv files. Don't want to pollute your repository here with this Stata code ;)

mjbloemer commented 4 years ago

Quick glimpse into openfisca:

hmgaudecker commented 4 years ago

Thanks! This seems to support my prior that the combination of a structured data format like yaml or json + generating readable views in html or the like would seem like a promising (& scalable) way forward eventually?

mjbloemer commented 4 years ago

A quick sketch of a possible param.yaml:

-  rs_hhvor
  group: alg2
  period: month
  unit: 1 Euro
  name:
    de: Regelsatz
    en: Standard rate
  description:
    de: Wirknorm ist §20 V SGB II... blabla.
    en: §20 V SGB II... bla bla.
  values:
    2005-07-01:
      value: 338
      note: B. v. 01.09.2005 BGBl. I S. 2718. (SGB2§20Bek 2005). Der tatsächliche Wert unterscheidet sich zwischen Ost und West. Hier wurde vereinfachend 338 Euro als ungewichteter Mittelwert genommen. Korrekte Werte für 2005 sind in den alten Bundesländern einschließlich Berlin (Ost) 345 Euro, in den neuen Bundesländern 331 Euro.
    2006-07-01:
      value: 345
      note: B. v. 20.07.2006 BGBl. I S. 1702. (SGB2§ 20Bek 2006)
    2007-07-01:
      value: 347
      note: B. v. 18.06.2007 BGBl. I S. 1139. (SGB2§ 20Bek 2007).
    2008-07-01:
      value: 351
      note: B. v. 26.06.2008 BGBl. I S. 1102. (SGB2§ 20Bek 2008).
    2009-07-01:
      value: 359
      note: B. v. 17.06.2009 BGBl. I S. 1342. (SGB2§ 20Bek 2009).
    2010-07-01:
      value: 359
      note: B. v. 07.06.2010 BGBl. I S. 820. (SGB2§ 20Bek 2010).
    2011-01-01:
      value: 364
      note: § 8 RBEG Artikel 1 G. v. 24.03.2011 BGBl. I S. 453 (Nr. 12).
    2012-12-01:
      value: 374
      note: B. v. 20.10.2011 BGBl. I S. 2093 (Nr. 53). FNA 860-2-16-1 Sozialgesetzbuch.
    2013-01-01:
      value: 382
      note: B. v. 18.10.2012 BGBl. I S. 2175 (Nr. 49). FNA 860-2-16-2 Sozialgesetzbuch.
    2014-01-01:
      value: 391
      note: B. v. 16.10.2013 BGBl. I S. 3857 (Nr. 63) FNA 860-2-16-3 Sozialgesetzbuch.
    2015-01-01:
      value: 399
      note: B. v. 15.10.2014 BGBl. I S. 1620 (Nr. 47) FNA 860-2-16-4 Sozialgesetzbuch.
    2016-01-01:
      value: 404
      note: B. v. 22.10.2015 BGBl. I S. 1792 (Nr. 41) FNA 860-2-16-5 Sozialgesetzbuch.
    2017-01-01:
      value: 409
      note: BGBl. I S. 3159.
    2018-01-01:
      value: 416
      note: RBSFV 2018, V. v. 08.11.2017 BGBl. I S. 3767 (Nr. 73).
    2019-01-01:
      value: 424
      note: RBSFV 2019, V. v. 19.10.2018 BGBl. I S. 1766 (Nr. 36).

-  a2grf
  group: alg2
  period: month
  unit: 1 Euro
  name:
    de: Anrechnungsfreier Grundbetrag
    en: Gross income not subject to transfer withdrawal
  description:
    de: § 11b (3) SGB II. Wirknorm §11b II SGB II.
    en: § 11b (3) SGB II. §11b II SGB II.
  values:
    2005-01-01:
      value: 100
      note: blablabla

- a2kiz
  group: kiz
  period: month
  unit: 1 Euro
  name: 
    de: Kinderzuschlag
    en: Additional Child Benefit
  description:
    de: Höhe des Kinderzuschlags. § 6a (2) BKGG.
    en: blabla...
  values:
    2005-01-01:
      value: 140
      note: blabla
    2016-07-01: 
      value: 160
      note: Geändert durch Artikel 7 G. v. 16.07.2015. BGBl. I S. 1202.
    2017-01-01: 
      value: 170
      note: Geändert durch Artikel 12 G. v. 20.12.2016. BGBl. I S. 3000.
    2019-07-01:
      value: 185
      note: Geändert durch Artikel 1 StaFamG v. 29.04.2019 BGBl. I S. 530 (Nr. 16).

A few notes/questions:

I think the readability is much better than csv. Also it should be easy to process this in a web based documentation.
From the py side: is this a structure that is easy to work with?
Do we really need an end date? Please give some feedback if you find a parameter where this is really needed.
I switched to a daily level. However, a first migration from the param.xls will start with the yearly level and can easily be extended later.
I added period and unit. Might come in handy if we want to have our parameter definitions closer to the written law (e.g. DM values or notation "in Percent").

hmgaudecker commented 4 years ago

This looks great, thanks!

I think the readability is much better than csv. Also it should be easy to process this in a web based documentation.

Indeed.

From the py side: is this a structure that is easy to work with?

I think so. Although couple of minutes playing around with pyyaml suggests we need Example 2.6. Mapping of Mappings from here for the structure you are trying to get add:

Mark McGwire: {hr: 65, avg: 0.278}
Sammy Sosa: {
    hr: 63,
    avg: 0.288
}

Looks a little more than JSON again. Eventually, we probably want to break into smaller files and we can go back to something closer to example 2.2 on that side.

Do we really need an end date? Please give some feedback if you find a parameter where this is really needed.

Maybe a special flag "abolished", e.g. for the Social Support pre Hartz reforms, maternity pay pre-Elterngeld etc.? I would think that this sort of stuff becomes tricky when entire functions change, not only parameters of similar functions.

I switched to a daily level. However, a first migration from the param.xls will start with the yearly level and can easily be extended later.

Is it harder to export on a daily level? If not, just use the Jan 1st for each year and, if possible, add a flag that the precise date is provisional.

I added period and unit. Might come in handy if we want to have our parameter definitions closer to the written law (e.g. DM values or notation "in Percent").

Very good. But let us please stick to Euro / DM and multiply by thousands or whatever if needed.

mjbloemer commented 4 years ago

From the py side: is this a structure that is easy to work with?

I think so. Although couple of minutes playing around with pyyaml suggests we need Example 2.6. Mapping of Mappings from here for the structure you are trying to get add:
Mark McGwire: {hr: 65, avg: 0.278}
Sammy Sosa: {
    hr: 63,
    avg: 0.288
}

Sorry, I have no experience with json or yaml. Where exactly do we need the brackets and the comma? I have the impression any structure can be achieved with indentation. If I understand correctly the brackets is only one alternative (flow) style. Still, I have no idea what is best for the python side.

Maybe a special flag "abolished", e.g. for the Social Support pre Hartz reforms, maternity pay pre-Elterngeld etc.? I would think that this sort of stuff becomes tricky when entire functions change, not only parameters of similar functions.

With the policies you mentioned I see no problem. For the year or period these policies are not in effect or abolished they should not be calculated anyway.

We still could add a missing value or not defined flag starting at a specific date. In very rare cases it is also possible to set a parameter to zero (before a law is in force or after) if we absolutely want to keep the calculation procedure for that year and calculate the policy. Anyway, I have the impression that in both cases this goes more in the direction of quick and dirty hacks and should be avoided as much as possible.

Is it harder to export on a daily level?

not at all, but it will just add a -01-01 to the year.

If not, just use the Jan 1st for each year and, if possible, add a flag that the precise date is provisional.

yes, this would be one of the post conversion tasks.

Very good. But let us please stick to Euro / DM and multiply by thousands or whatever if needed.

I think every monetary value in param.xls is currently in Euro; DM values are converted to Euro (which makes it just a little bit harder when comparing to the written law).

hmgaudecker commented 4 years ago

Sorry, I have no experience with json or yaml. Where exactly do we need the brackets and the comma? I have the impression any structure can be achieved with indentation. If I understand correctly the brackets is only one alternative (flow) style. Still, I have no idea what is best for the python side.

Great, thanks -- you did much better research than me. Sorry for the noise, you were very close:

rs_hhvor:
  group: alg2
  period: month
  unit: Euro
  name:
    de: Regelsatz
    en: Standard rate

The only difference is in the top level: key: instead of - key there.

@MaxBlesch: This gives a wonderful roundtrip with yaml.safe_load() and yaml.safe_dump().

Maybe a special flag "abolished", e.g. for the Social Support pre Hartz reforms, maternity pay pre-Elterngeld etc.? I would think that this sort of stuff becomes tricky when entire functions change, not only parameters of similar functions.

With the policies you mentioned I see no problem. For the year or period these policies are not in effect or abolished they should not be calculated anyway.

Precisely. But ideally, we would have that information in the yaml file (=everything in one place) and make use of it inside the code, rather than hard-coding conditionals there.

We still could add a missing value or not defined flag starting at a specific date. In very rare cases it is also possible to set a parameter to zero (before a law is in force or after) if we absolutely want to keep the calculation procedure for that year and calculate the policy. Anyway, I have the impression that in both cases this goes more in the direction of quick and dirty hacks and should be avoided as much as possible.

My view is just the opposite: Eventually, I would love to see that we automatically choose which functions to call based on which rules are in effect.

Is it harder to export on a daily level?

not at all, but it will just add a -01-01 to the year.

If not, just use the Jan 1st for each year and, if possible, add a flag that the precise date is provisional.

yes, this would be one of the post conversion tasks.

Excellent.

Very good. But let us please stick to Euro / DM and multiply by thousands or whatever if needed.

I think every monetary value in param.xls is currently in Euro; DM values are converted to Euro (which makes it just a little bit harder when comparing to the written law).

Perfect. I have no strong opinion on whether we want to convert those values now, but eventually DM would seem to make more sense to me.

Eric-Sommer commented 4 years ago

I think every monetary value in param.xls is currently in Euro; DM values are converted to Euro (which makes it just a little bit harder when comparing to the written law).

Perfect. I have no strong opinion on whether we want to convert those values now, but eventually DM would seem to make more sense to me.

There is one exception, namely the tax tariff parameters before 2002. The others are in Euro; initially because SOEP delivers all incomes in Euro as well. I'd prefer to sticking to Euro all the way. If you do a long-term analysis, you'd report in Euro anyway. It's easier to convert incomes rather than parameters, as there are lots of parameters (shares) which must not be converted.

hmgaudecker commented 4 years ago

I think every monetary value in param.xls is currently in Euro; DM values are converted to Euro (which makes it just a little bit harder when comparing to the written law).

Perfect. I have no strong opinion on whether we want to convert those values now, but eventually DM would seem to make more sense to me.

There is one exception, namely the tax tariff parameters before 2002. The others are in Euro; initially because SOEP delivers all incomes in Euro as well. I'd prefer to sticking to Euro all the way. If you do a long-term analysis, you'd report in Euro anyway. It's easier to convert incomes rather than parameters, as there are lots of parameters (shares) which must not be converted.

I think we are on the same page in the sense that internally, we will do everything in Euros.

I disagree on parameters being more difficult to convert. That may be the case in Stata if you have to store all these animals in macros, but here?

def convert_dm_to_euro(dm):
    return dm / 1.95583

if unit == "Euro":
    val = raw_val
elif unit == "DM":
    val = convert_dm_to_euro(dm)
else:
   raise ValueError(f"Monetary Unit unknown: {raw_val}")

So this is only about how to store the input parameters and I think the idea of sticking close to the law is good there. As I wrote previously, whether we want to do that right away or do that at some later point (please open an issue in that case once we close this one, @mjbloemer lest we forget) I do not care much.

What we want to support in terms of input data is a wholly different question.

mjbloemer commented 4 years ago

I changed the title to keep only the params aspect. This issue can be closed if e.g. #54 is merged.

If the datatype of test data aspect is still relevant, someone should open a separate issue.

iza-institute-of-labor-economics / gettsim

Datatype of params #31