iza-institute-of-labor-economics / gettsim

The GErman Taxes and Transfers SIMulator
https://gettsim.readthedocs.io/
GNU Affero General Public License v3.0
56 stars 33 forks source link

Linking children's transfers to their parents. #676

Closed MImmesberger closed 9 months ago

MImmesberger commented 11 months ago

This PR is supposed to link transfers on the children level directly to their parents based on p_id_elternteil_1 and p_id_elternteil_2.

This affects the calculation of Kinderfreibetrag, Kindergeld, Erziehungsgeld and Elterngeld.

Additionally, take-up issues are relevant regarding the Kindergeld, Erziehungsgeld and Elterngeld (does p_id_elternteil_1 or p_id_elternteil_2 take up the transfer?).

MImmesberger commented 11 months ago

I thought about the implementation for some time and realised that having parents-children links would require larger changes to the grouping and aggregation code (at least _create_aggregation_functions and _create_derived_functions) as groups would be overlapping (because children belong to an "elternteil_1" and "elternteil_2" group and because children may have children themselves).

An alternative would be to handle p_id-specific operations across observations separately from the currently existing aggregation steps.

Have we already discussed how we want to implement his? @hmgaudecker @lars-reimann

hmgaudecker commented 11 months ago

Good points. My thinking has been somewhat sloppy indeed.

codecov[bot] commented 10 months ago

Codecov Report

Attention: 115 lines in your changes are missing coverage. Please review.

Comparison is base (f295be3) 91.54% compared to head (3aea9fa) 89.22%.

Files Patch % Lines
src/_gettsim/functions_loader.py 64.51% 44 Missing :warning:
src/_gettsim/aggregation.py 4.76% 40 Missing :warning:
src/_gettsim/aggregation_numpy.py 47.45% 31 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #676 +/- ## ========================================== - Coverage 91.54% 89.22% -2.33% ========================================== Files 51 51 Lines 3443 3592 +149 ========================================== + Hits 3152 3205 +53 - Misses 291 387 +96 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

MImmesberger commented 10 months ago

How we implement the required changes depends on how strong we want to adhere to the current p_id_elternteil_x design. We need to the discuss how the user inputs look like.

Taking Kindergeld as an example we could do it this way:

  1. GETTSIM calculates the transfer the person in its own right is eligible for (on the individual level). We already do this for Kindergeld and Erziehungsgeld I think. The target could be something like kindergeld_anspruch (float). We do this for every person in the dataset (so also for rows that are not eligible for Kindergeld).
  2. We use user-specified pointers to the p_id of the one who actually claims (and receives) the benefit. The basic input column could be here p_id_kindergeld_auszahlung. This is specified on the child level. We can also add defaults and checks whether the user input is correct (i.e. the person that claims the benefit can actually do so).
  3. The benefit is aggregated on the basis of p_id_kindergeld_auszahlung and summarized in a variable kindergeld_m.

An alternative would be the following:

We utilise the existing p_id_elternteil_x columns. The procedure is the same as above, but in step 2, the user specifies a variable kindergeld_elternteil_auszahlung that determines whether parent 1 or 2 (or the child) receives the transfer. The downside is that we cannot represent transfers that should be payed out to someone else (e.g. think of grandparents receiving Kindergeld, but also Unterhalt from the parents).

hmgaudecker commented 10 months ago

I am afraid I am unable to fully follow. The basic structure seems the same, only whether we add an explicit pointer for p_id_kindergeldempfänger (that seems to be the "right" name to me)?

If that is the case, I think we should go for the explicit one.

kinderfreibetrag is likely the more intricate case. Typically each parent receives one, but there are cases where one parent gets both. So we might need something like p_id_kinderfreibetragempfänger_1 and p_id_kinderfreibetragempfänger_2?

hmgaudecker commented 10 months ago

In general, we always prefer the user specifying things explicitly (even if it creates a slightly higher burden on her) rather than assuming things which cover 90% of the use cases and are wrong for the remainder.

MImmesberger commented 10 months ago

I agree, that is what I had in mind as well.

Then we should add tests of the user input at some point (e.g. if both parents are specified via p_id_kinderfreibetragempfänger_1 and p_id_kinderfreibetragempfänger_2, one of those two persons (or the child itself) must receive the Kindergeld and no one else).

hmgaudecker commented 10 months ago

Yes, I am afraid that there will be plenty of data consistency checks in these. I do think, however, that we can add them as we go along. No need to research all of them before they crop up.

MImmesberger commented 10 months ago

I added some test files and renamed some functions to illustrate the changes necessary. I propose the following convention:

Take some tax or Freibetrag called x:

In case you have different naming suggestions let me know (I'm not too happy with x_eltern_m but I can't think of a better name that is short and applies to all children related transfers).

MImmesberger commented 9 months ago

To summarize the discussion so far, there are two steps:

Creating p_id_kinderfreibetrag_1 and p_id_kinderfreibetrag_2

(A side note: Having one parent that receives both Kinderfreibeträge is only possible if the other parent doesn't satisfy alimony requirements to some extent. I would prefer to leave that out for now.)

Aggregating the children-related transfers / the children-induced entitlements for Freibeträge to their parents.

The tests that I wrote should cover all the new features that are discussed here.

Does that make sense to everyone? @hmgaudecker @lars-reimann

lars-reimann commented 9 months ago

From a design point of view, this seems to be a reasonable solution.

hmgaudecker commented 9 months ago

Sounds great! Just one little clarification question:

(A side note: Having one parent that receives both Kinderfreibeträge is only possible if the other parent doesn't satisfy alimony requirements to some extent. I would prefer to leave that out for now.)

Not sure what you mean by "leave out":

  1. Disallow the possibility that p_id_kinderfreibetrag_1 and p_id_kinderfreibetrag_2 point to the same person?
  2. Do not attempt to model the rules that lead to p_id_kinderfreibetrag_1 and p_id_kinderfreibetrag_2 pointing to the same person when the user does not specify these two p_id-variables explicitly?

I'd be all for 2. (probably not even possible as this seems to be decided on a case-by-case basis), but would prefer to allow for 1.. So users should be able to make it happen, but the typical use case will not cover it.

MImmesberger commented 9 months ago

Yes, should have been more clear about this. That's exactly what I had in mind. We generally allow for double Freibeträge but we don't create them endogenously -- they have to be enforced by the user.

MImmesberger commented 9 months ago

I realized that parent-child links in the case of Kindergeld are more complex than Freibeträge and Erziehungsgeld because in the past Kindergeld depended on the number of children.

This implies that the notion of "Kindergeld belongs to the child but is payed out to the parent" is wrong in this case; we cannot assign the amount of Kindergeld that the child is eligible for without knowing i) which parent claims Kindergeld and ii) for how many children this parent claims Kindergeld (think of couples with children from different partners).

This sounds like we need to treat Kindergeld very similar to Kinderfreibeträge: It doesn't exist on the child level, but it may be payed out to the child instead of the parent in special cases. However, now I'm not sure whether we want to even cover those special cases?

hmgaudecker commented 9 months ago
  1. Yes, I think it's fine to generally think in two steps — binary eligibility at the child level and do the amount calculation only where it is required, typically parents.
  2. Special cases: nothing to be done in this PR. Should be possible to just point at the kids itself some day, if necessary.
MImmesberger commented 9 months ago

I implemented the parent-child links via a dict in the config.py. The implementation doesn't build on the parent-child logic but works via the pointers specified under the id_col key:

PARENT_CHILD_LINKED_TARGETS = {
    "eink_st_kinderfreib_anz_ansprüche": {
        "id_col": [
            "p_id_kinderfreib_empfänger_1",
            "p_id_kinderfreib_empfänger_2",
        ],
        "source_col": "kindergeld_anspruch",
    },
    "erziehungsgeld_eltern_m": {
        "id_col": "p_id_erziehgeld_empf",
        "source_col": "erziehungsgeld_kind_m",
    },
    "kindergeld_anz_anprüche": {
        "id_col": "p_id_kindergeld_empf",
        "source_col": "kindergeld_anspruch",
    },
}

@lars-reimann I have issues with the tests. On my machine, everything works fine (even when I set USE_JAX to True), but here they fail and I have no idea why. Could you have a look?

lars-reimann commented 9 months ago

@lars-reimann I have issues with the tests. On my machine, everything works fine (even when I set USE_JAX to True), but here they fail and I have no idea why. Could you have a look?

@MImmesberger You could try making any change to the environment.yml file and pushing that, so a new environment gets created in CI. Currently, it uses a cached one.

lars-reimann commented 9 months ago

Hmm, the expected values are all floating point numbers (e.g. [right]: [9312.0, 4656.0, 0.0, 0.0]), while the returned numbers are ints (e.g. [left]: [9540, 4770, 0, 0]). Is there some unintended conversion happening along the way?

MImmesberger commented 9 months ago

Hmm, the expected values are all floating point numbers

The error occurs because I used the wrong parameters. What is more interesting is why pytest didn't complain on my (and Lars') local machine. It complained after I changed the relevant parameters from floats to ints, so I suppose one weak spot in my implementation is that it matters whether a parameter is used in int or float format.

I don't know what the annotations exactly do, but could it be this line? @lars-reimann

annotations["returns"] = int if annotations["source_col"] in (int, bool) else float

Else, I don't know what could create such a behavior.

MImmesberger commented 9 months ago

On the other hand I am not able to reproduce the pytest-not-complaining-thing when jumping to the previous commits.

MImmesberger commented 9 months ago

Ah, I figured it out. I didn't know that the tests are running on the main branch. This PR wasn't up to date at that time. Sorry for the mess.

hmgaudecker commented 9 months ago

Ah, I figured it out. I didn't know that the tests are running on the main branch. This PR wasn't up to date at that time. Sorry for the mess.

Not sure what you mean? They are not running on the main branch but on the branch associated with the PR. And locally on whatever branch you have checked out.

The "main" above refers to the workflow name specified here.

MImmesberger commented 9 months ago

Not sure what you mean? They are not running on the main branch but on the branch associated with the PR. And locally on whatever branch you have checked out.

Then I misinterpreted at article, thanks!

MImmesberger commented 9 months ago

I'm done. Parent-child links are specified as dicts in the respective module, just as the aggregation specifications. The current implementation allows for specifications provided by the user (as with the aggregation functions).

Apparently, there is no codecov base report. We hit 82% of diff, but as far as I can tell this is largely due to lines that weren't touched in this PR.

MImmesberger commented 9 months ago

GEPs and documentation are updated! @hmgaudecker after you reviewed it, we can merge. I would vote for getting rid of the other tus and correctly specifying the Kindergeld functions in another PR.

MImmesberger commented 9 months ago

Codecov complains because i) we didn't test (and implement) the new aggregation functions and ii) internal handling of aggregation_by_p_id is not tested (as discussed yesterday). Would vote for ignoring it.