Linking children's transfers to their parents.

MImmesberger commented 11 months ago

This PR is supposed to link transfers on the children level directly to their parents based on p_id_elternteil_1 and p_id_elternteil_2.

This affects the calculation of Kinderfreibetrag, Kindergeld, Erziehungsgeld and Elterngeld.

Additionally, take-up issues are relevant regarding the Kindergeld, Erziehungsgeld and Elterngeld (does p_id_elternteil_1 or p_id_elternteil_2 take up the transfer?).

MImmesberger commented 11 months ago

I thought about the implementation for some time and realised that having parents-children links would require larger changes to the grouping and aggregation code (at least _create_aggregation_functions and _create_derived_functions) as groups would be overlapping (because children belong to an "elternteil_1" and "elternteil_2" group and because children may have children themselves).

An alternative would be to handle p_id-specific operations across observations separately from the currently existing aggregation steps.

Have we already discussed how we want to implement his? @hmgaudecker @lars-reimann

hmgaudecker commented 11 months ago

Good points. My thinking has been somewhat sloppy indeed.

We may need a differentiation between variables
- Who is the reason for a claim (i.e., the kid)
- Who actually makes the claim / to whom it is paid out (Kindergeld) / who benefits from it (Kinderfreibetrag) (typically one or both parents, but for kids aged 18-24, it could actually be the kid itself, I believe)
We probably need three steps
1. Determining whether someone is reason for a claim (rules in BKGG §1, §2, store an indicator with the kid
2. Distribute Kindergeld across parents (not sure about institutions here, i.e., whether it makes sense to set half a claim for both and have a separate variable for payout, which is always to one person only; or to go with one payout only)
3. Distribute Kinderfreibeträge across parents (typically one each, but I think this can be overridden by choice)
For 2. and 3., the initial variables will be stored with the kid, then summed up by the each parent's p_id. Just need to make sure this is merged by p_id again, so the resulting vector has the correct length and order.

codecov[bot] commented 10 months ago

Codecov Report

Attention: 115 lines in your changes are missing coverage. Please review.

Comparison is base (f295be3) 91.54% compared to head (3aea9fa) 89.22%.

Files	Patch %	Lines
src/_gettsim/functions_loader.py	64.51%	44 Missing :warning:
src/_gettsim/aggregation.py	4.76%	40 Missing :warning:
src/_gettsim/aggregation_numpy.py	47.45%	31 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #676 +/- ## ========================================== - Coverage 91.54% 89.22% -2.33% ========================================== Files 51 51 Lines 3443 3592 +149 ========================================== + Hits 3152 3205 +53 - Misses 291 387 +96 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

MImmesberger commented 10 months ago

How we implement the required changes depends on how strong we want to adhere to the current p_id_elternteil_x design. We need to the discuss how the user inputs look like.

Taking Kindergeld as an example we could do it this way:

GETTSIM calculates the transfer the person in its own right is eligible for (on the individual level). We already do this for Kindergeld and Erziehungsgeld I think. The target could be something like kindergeld_anspruch (float). We do this for every person in the dataset (so also for rows that are not eligible for Kindergeld).
We use user-specified pointers to the p_id of the one who actually claims (and receives) the benefit. The basic input column could be here p_id_kindergeld_auszahlung. This is specified on the child level. We can also add defaults and checks whether the user input is correct (i.e. the person that claims the benefit can actually do so).
The benefit is aggregated on the basis of p_id_kindergeld_auszahlung and summarized in a variable kindergeld_m.

An alternative would be the following:

We utilise the existing p_id_elternteil_x columns. The procedure is the same as above, but in step 2, the user specifies a variable kindergeld_elternteil_auszahlung that determines whether parent 1 or 2 (or the child) receives the transfer. The downside is that we cannot represent transfers that should be payed out to someone else (e.g. think of grandparents receiving Kindergeld, but also Unterhalt from the parents).

hmgaudecker commented 10 months ago

I am afraid I am unable to fully follow. The basic structure seems the same, only whether we add an explicit pointer for p_id_kindergeldempfänger (that seems to be the "right" name to me)?

If that is the case, I think we should go for the explicit one.

kinderfreibetrag is likely the more intricate case. Typically each parent receives one, but there are cases where one parent gets both. So we might need something like p_id_kinderfreibetragempfänger_1 and p_id_kinderfreibetragempfänger_2?

hmgaudecker commented 10 months ago

In general, we always prefer the user specifying things explicitly (even if it creates a slightly higher burden on her) rather than assuming things which cover 90% of the use cases and are wrong for the remainder.

MImmesberger commented 10 months ago

I agree, that is what I had in mind as well.

Then we should add tests of the user input at some point (e.g. if both parents are specified via p_id_kinderfreibetragempfänger_1 and p_id_kinderfreibetragempfänger_2, one of those two persons (or the child itself) must receive the Kindergeld and no one else).

hmgaudecker commented 10 months ago

Yes, I am afraid that there will be plenty of data consistency checks in these. I do think, however, that we can add them as we go along. No need to research all of them before they crop up.

MImmesberger commented 10 months ago

I added some test files and renamed some functions to illustrate the changes necessary. I propose the following convention:

Take some tax or Freibetrag called x:

x_kind_m yields the transfer or Freibetrag the individual is eligible for in its own right. I changed the functions for Kindergeld, Erziehungsgeld and Kinderfreibetrag accordingly.
x_eltern_m aggregates the transfers that some person gets via a new input colum: p_id_x_empfänger. One exception: Kinderfreibeträge are distributed via p_id_kinderfreibetrag_empfänger_1 and p_id_kinderfreibetrag_empfänger_2. This is the step that needs to be implemented.
We rename x_eltern_m to x_m via a simple function. This is necessary because for some x we need to make checks on the parent level before the transfer can be payed out (e.g. Erziehungsgeld, check whether parents don't work too much).

In case you have different naming suggestions let me know (I'm not too happy with x_eltern_m but I can't think of a better name that is short and applies to all children related transfers).

MImmesberger commented 9 months ago

To summarize the discussion so far, there are two steps:

Creating p_id_kinderfreibetrag_1 and p_id_kinderfreibetrag_2

In general, this just mirrors p_id_elternteil_1 and p_id_elternteil_2.
If there are no parents, the user can set p_id_kinderfreibetrag_1 and p_id_kinderfreibetrag_2 herself, but we set the usual -1 default value.
If only one parent is specified via p_id_elternteil_1, p_id_kinderfreibetrag_1 points to that parent, but not p_id_kinderfreibetrag_2.

(A side note: Having one parent that receives both Kinderfreibeträge is only possible if the other parent doesn't satisfy alimony requirements to some extent. I would prefer to leave that out for now.)

Aggregating the children-related transfers / the children-induced entitlements for Freibeträge to their parents.

This is based on the pointers that are endogenously created (for the case of Kinderfreibetrag) or specified by the user (all transfers).
For $x\in$ {eink_st_kinderfreib, erziehungsgeld, kindergeld} we need to aggregate individual-level columns marked via x_kind (plus the usual aggregation suffixes) to a new column x_eltern based on p_id_x_empfänger (transfers) or p_id_x_empfänger_1/p_id_x_empfänger_2 (Freibeträge).

The tests that I wrote should cover all the new features that are discussed here.

Does that make sense to everyone? @hmgaudecker @lars-reimann

lars-reimann commented 9 months ago

From a design point of view, this seems to be a reasonable solution.

hmgaudecker commented 9 months ago

Sounds great! Just one little clarification question:

(A side note: Having one parent that receives both Kinderfreibeträge is only possible if the other parent doesn't satisfy alimony requirements to some extent. I would prefer to leave that out for now.)

Not sure what you mean by "leave out":

Disallow the possibility that p_id_kinderfreibetrag_1 and p_id_kinderfreibetrag_2 point to the same person?
Do not attempt to model the rules that lead to p_id_kinderfreibetrag_1 and p_id_kinderfreibetrag_2 pointing to the same person when the user does not specify these two p_id-variables explicitly?

I'd be all for 2. (probably not even possible as this seems to be decided on a case-by-case basis), but would prefer to allow for 1.. So users should be able to make it happen, but the typical use case will not cover it.

MImmesberger commented 9 months ago

Yes, should have been more clear about this. That's exactly what I had in mind. We generally allow for double Freibeträge but we don't create them endogenously -- they have to be enforced by the user.

MImmesberger commented 9 months ago

I realized that parent-child links in the case of Kindergeld are more complex than Freibeträge and Erziehungsgeld because in the past Kindergeld depended on the number of children.

This implies that the notion of "Kindergeld belongs to the child but is payed out to the parent" is wrong in this case; we cannot assign the amount of Kindergeld that the child is eligible for without knowing i) which parent claims Kindergeld and ii) for how many children this parent claims Kindergeld (think of couples with children from different partners).

This sounds like we need to treat Kindergeld very similar to Kinderfreibeträge: It doesn't exist on the child level, but it may be payed out to the child instead of the parent in special cases. However, now I'm not sure whether we want to even cover those special cases?

hmgaudecker commented 9 months ago

Yes, I think it's fine to generally think in two steps — binary eligibility at the child level and do the amount calculation only where it is required, typically parents.
Special cases: nothing to be done in this PR. Should be possible to just point at the kids itself some day, if necessary.

MImmesberger commented 9 months ago

I implemented the parent-child links via a dict in the config.py. The implementation doesn't build on the parent-child logic but works via the pointers specified under the id_col key:

PARENT_CHILD_LINKED_TARGETS = {
    "eink_st_kinderfreib_anz_ansprüche": {
        "id_col": [
            "p_id_kinderfreib_empfänger_1",
            "p_id_kinderfreib_empfänger_2",
        ],
        "source_col": "kindergeld_anspruch",
    },
    "erziehungsgeld_eltern_m": {
        "id_col": "p_id_erziehgeld_empf",
        "source_col": "erziehungsgeld_kind_m",
    },
    "kindergeld_anz_anprüche": {
        "id_col": "p_id_kindergeld_empf",
        "source_col": "kindergeld_anspruch",
    },
}

@lars-reimann I have issues with the tests. On my machine, everything works fine (even when I set USE_JAX to True), but here they fail and I have no idea why. Could you have a look?

lars-reimann commented 9 months ago

@lars-reimann I have issues with the tests. On my machine, everything works fine (even when I set USE_JAX to True), but here they fail and I have no idea why. Could you have a look?

The tests work on my system, too.
After updating the conda environment with conda env update --file environment.yml --prune, the tests still work.
The expected/actual values are relatively close to each other. Could be different floating point precision along the way?
Dependencies are not pinned, so different versions might be used locally and during CI. This might be the cause for the precision differences.

@MImmesberger You could try making any change to the environment.yml file and pushing that, so a new environment gets created in CI. Currently, it uses a cached one.

lars-reimann commented 9 months ago

Hmm, the expected values are all floating point numbers (e.g. [right]: [9312.0, 4656.0, 0.0, 0.0]), while the returned numbers are ints (e.g. [left]: [9540, 4770, 0, 0]). Is there some unintended conversion happening along the way?

MImmesberger commented 9 months ago

Hmm, the expected values are all floating point numbers

The error occurs because I used the wrong parameters. What is more interesting is why pytest didn't complain on my (and Lars') local machine. It complained after I changed the relevant parameters from floats to ints, so I suppose one weak spot in my implementation is that it matters whether a parameter is used in int or float format.

I don't know what the annotations exactly do, but could it be this line? @lars-reimann

annotations["returns"] = int if annotations["source_col"] in (int, bool) else float

Else, I don't know what could create such a behavior.

MImmesberger commented 9 months ago

On the other hand I am not able to reproduce the pytest-not-complaining-thing when jumping to the previous commits.

MImmesberger commented 9 months ago

Ah, I figured it out. I didn't know that the tests are running on the main branch. This PR wasn't up to date at that time. Sorry for the mess.

hmgaudecker commented 9 months ago

Ah, I figured it out. I didn't know that the tests are running on the main branch. This PR wasn't up to date at that time. Sorry for the mess.

Not sure what you mean? They are not running on the main branch but on the branch associated with the PR. And locally on whatever branch you have checked out.

The "main" above refers to the workflow name specified here.

MImmesberger commented 9 months ago

Not sure what you mean? They are not running on the main branch but on the branch associated with the PR. And locally on whatever branch you have checked out.

Then I misinterpreted at article, thanks!

MImmesberger commented 9 months ago

I'm done. Parent-child links are specified as dicts in the respective module, just as the aggregation specifications. The current implementation allows for specifications provided by the user (as with the aggregation functions).

Apparently, there is no codecov base report. We hit 82% of diff, but as far as I can tell this is largely due to lines that weren't touched in this PR.

MImmesberger commented 9 months ago

GEPs and documentation are updated! @hmgaudecker after you reviewed it, we can merge. I would vote for getting rid of the other tus and correctly specifying the Kindergeld functions in another PR.

MImmesberger commented 9 months ago

Codecov complains because i) we didn't test (and implement) the new aggregation functions and ii) internal handling of aggregation_by_p_id is not tested (as discussed yesterday). Would vote for ignoring it.

iza-institute-of-labor-economics / gettsim

Linking children's transfers to their parents. #676

Codecov Report