Closed MImmesberger closed 9 months ago
I thought about the implementation for some time and realised that having parents-children links would require larger changes to the grouping and aggregation code (at least _create_aggregation_functions
and _create_derived_functions
) as groups would be overlapping (because children belong to an "elternteil_1" and "elternteil_2" group and because children may have children themselves).
An alternative would be to handle p_id
-specific operations across observations separately from the currently existing aggregation steps.
Have we already discussed how we want to implement his? @hmgaudecker @lars-reimann
Good points. My thinking has been somewhat sloppy indeed.
p_id
. Just need to make sure this is merged by p_id again, so the resulting vector has the correct length and order.Attention: 115 lines
in your changes are missing coverage. Please review.
Comparison is base (
f295be3
) 91.54% compared to head (3aea9fa
) 89.22%.
Files | Patch % | Lines |
---|---|---|
src/_gettsim/functions_loader.py | 64.51% | 44 Missing :warning: |
src/_gettsim/aggregation.py | 4.76% | 40 Missing :warning: |
src/_gettsim/aggregation_numpy.py | 47.45% | 31 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
How we implement the required changes depends on how strong we want to adhere to the current p_id_elternteil_x
design. We need to the discuss how the user inputs look like.
Taking Kindergeld as an example we could do it this way:
kindergeld_anspruch
(float). We do this for every person in the dataset (so also for rows that are not eligible for Kindergeld).p_id
of the one who actually claims (and receives) the benefit. The basic input column could be here p_id_kindergeld_auszahlung
. This is specified on the child level. We can also add defaults and checks whether the user input is correct (i.e. the person that claims the benefit can actually do so).p_id_kindergeld_auszahlung
and summarized in a variable kindergeld_m
. An alternative would be the following:
We utilise the existing p_id_elternteil_x
columns. The procedure is the same as above, but in step 2, the user specifies a variable kindergeld_elternteil_auszahlung
that determines whether parent 1
or 2
(or the child) receives the transfer. The downside is that we cannot represent transfers that should be payed out to someone else (e.g. think of grandparents receiving Kindergeld, but also Unterhalt from the parents).
I am afraid I am unable to fully follow. The basic structure seems the same, only whether we add an explicit pointer for p_id_kindergeldempfänger
(that seems to be the "right" name to me)?
If that is the case, I think we should go for the explicit one.
kinderfreibetrag
is likely the more intricate case. Typically each parent receives one, but there are cases where one parent gets both. So we might need something like p_id_kinderfreibetragempfänger_1
and p_id_kinderfreibetragempfänger_2
?
In general, we always prefer the user specifying things explicitly (even if it creates a slightly higher burden on her) rather than assuming things which cover 90% of the use cases and are wrong for the remainder.
I agree, that is what I had in mind as well.
Then we should add tests of the user input at some point (e.g. if both parents are specified via p_id_kinderfreibetragempfänger_1
and p_id_kinderfreibetragempfänger_2
, one of those two persons (or the child itself) must receive the Kindergeld and no one else).
Yes, I am afraid that there will be plenty of data consistency checks in these. I do think, however, that we can add them as we go along. No need to research all of them before they crop up.
I added some test files and renamed some functions to illustrate the changes necessary. I propose the following convention:
Take some tax or Freibetrag called x
:
x_kind_m
yields the transfer or Freibetrag the individual is eligible for in its own right. I changed the functions for Kindergeld, Erziehungsgeld and Kinderfreibetrag accordingly.x_eltern_m
aggregates the transfers that some person gets via a new input colum: p_id_x_empfänger
. One exception: Kinderfreibeträge are distributed via p_id_kinderfreibetrag_empfänger_1
and p_id_kinderfreibetrag_empfänger_2
. This is the step that needs to be implemented.x_eltern_m
to x_m
via a simple function. This is necessary because for some x
we need to make checks on the parent level before the transfer can be payed out (e.g. Erziehungsgeld, check whether parents don't work too much).In case you have different naming suggestions let me know (I'm not too happy with x_eltern_m
but I can't think of a better name that is short and applies to all children related transfers).
To summarize the discussion so far, there are two steps:
Creating p_id_kinderfreibetrag_1
and p_id_kinderfreibetrag_2
p_id_elternteil_1
and p_id_elternteil_2
. p_id_kinderfreibetrag_1
and p_id_kinderfreibetrag_2
herself, but we set the usual -1
default value.p_id_elternteil_1
, p_id_kinderfreibetrag_1
points to that parent, but not p_id_kinderfreibetrag_2
.(A side note: Having one parent that receives both Kinderfreibeträge is only possible if the other parent doesn't satisfy alimony requirements to some extent. I would prefer to leave that out for now.)
Aggregating the children-related transfers / the children-induced entitlements for Freibeträge to their parents.
x_kind
(plus the usual aggregation suffixes) to a new column x_eltern
based on p_id_x_empfänger
(transfers) or p_id_x_empfänger_1
/p_id_x_empfänger_2
(Freibeträge).The tests that I wrote should cover all the new features that are discussed here.
Does that make sense to everyone? @hmgaudecker @lars-reimann
From a design point of view, this seems to be a reasonable solution.
Sounds great! Just one little clarification question:
(A side note: Having one parent that receives both Kinderfreibeträge is only possible if the other parent doesn't satisfy alimony requirements to some extent. I would prefer to leave that out for now.)
Not sure what you mean by "leave out":
p_id_kinderfreibetrag_1
and p_id_kinderfreibetrag_2
point to the same person?p_id_kinderfreibetrag_1
and p_id_kinderfreibetrag_2
pointing to the same person when the user does not specify these two p_id
-variables explicitly?I'd be all for 2. (probably not even possible as this seems to be decided on a case-by-case basis), but would prefer to allow for 1.. So users should be able to make it happen, but the typical use case will not cover it.
Yes, should have been more clear about this. That's exactly what I had in mind. We generally allow for double Freibeträge but we don't create them endogenously -- they have to be enforced by the user.
I realized that parent-child links in the case of Kindergeld are more complex than Freibeträge and Erziehungsgeld because in the past Kindergeld depended on the number of children.
This implies that the notion of "Kindergeld belongs to the child but is payed out to the parent" is wrong in this case; we cannot assign the amount of Kindergeld that the child is eligible for without knowing i) which parent claims Kindergeld and ii) for how many children this parent claims Kindergeld (think of couples with children from different partners).
This sounds like we need to treat Kindergeld very similar to Kinderfreibeträge: It doesn't exist on the child level, but it may be payed out to the child instead of the parent in special cases. However, now I'm not sure whether we want to even cover those special cases?
I implemented the parent-child links via a dict in the config.py
. The implementation doesn't build on the parent-child logic but works via the pointers specified under the id_col
key:
PARENT_CHILD_LINKED_TARGETS = {
"eink_st_kinderfreib_anz_ansprüche": {
"id_col": [
"p_id_kinderfreib_empfänger_1",
"p_id_kinderfreib_empfänger_2",
],
"source_col": "kindergeld_anspruch",
},
"erziehungsgeld_eltern_m": {
"id_col": "p_id_erziehgeld_empf",
"source_col": "erziehungsgeld_kind_m",
},
"kindergeld_anz_anprüche": {
"id_col": "p_id_kindergeld_empf",
"source_col": "kindergeld_anspruch",
},
}
@lars-reimann I have issues with the tests. On my machine, everything works fine (even when I set USE_JAX
to True), but here they fail and I have no idea why. Could you have a look?
@lars-reimann I have issues with the tests. On my machine, everything works fine (even when I set
USE_JAX
to True), but here they fail and I have no idea why. Could you have a look?
conda
environment with conda env update --file environment.yml --prune
, the tests still work.@MImmesberger You could try making any change to the environment.yml
file and pushing that, so a new environment gets created in CI. Currently, it uses a cached one.
Hmm, the expected values are all floating point numbers (e.g. [right]: [9312.0, 4656.0, 0.0, 0.0]
), while the returned numbers are ints (e.g. [left]: [9540, 4770, 0, 0]
). Is there some unintended conversion happening along the way?
Hmm, the expected values are all floating point numbers
The error occurs because I used the wrong parameters. What is more interesting is why pytest didn't complain on my (and Lars') local machine. It complained after I changed the relevant parameters from floats to ints, so I suppose one weak spot in my implementation is that it matters whether a parameter is used in int or float format.
I don't know what the annotations exactly do, but could it be this line? @lars-reimann
annotations["returns"] = int if annotations["source_col"] in (int, bool) else float
Else, I don't know what could create such a behavior.
On the other hand I am not able to reproduce the pytest-not-complaining-thing when jumping to the previous commits.
Ah, I figured it out. I didn't know that the tests are running on the main branch. This PR wasn't up to date at that time. Sorry for the mess.
Ah, I figured it out. I didn't know that the tests are running on the main branch. This PR wasn't up to date at that time. Sorry for the mess.
Not sure what you mean? They are not running on the main branch but on the branch associated with the PR. And locally on whatever branch you have checked out.
The "main" above refers to the workflow name specified here.
Not sure what you mean? They are not running on the main branch but on the branch associated with the PR. And locally on whatever branch you have checked out.
Then I misinterpreted at article, thanks!
I'm done. Parent-child links are specified as dicts in the respective module, just as the aggregation specifications. The current implementation allows for specifications provided by the user (as with the aggregation functions).
Apparently, there is no codecov base report. We hit 82% of diff, but as far as I can tell this is largely due to lines that weren't touched in this PR.
GEPs and documentation are updated! @hmgaudecker after you reviewed it, we can merge. I would vote for getting rid of the other tu
s and correctly specifying the Kindergeld functions in another PR.
Codecov complains because i) we didn't test (and implement) the new aggregation functions and ii) internal handling of aggregation_by_p_id is not tested (as discussed yesterday). Would vote for ignoring it.
This PR is supposed to link transfers on the children level directly to their parents based on
p_id_elternteil_1
andp_id_elternteil_2
.This affects the calculation of Kinderfreibetrag, Kindergeld, Erziehungsgeld and Elterngeld.
Additionally, take-up issues are relevant regarding the Kindergeld, Erziehungsgeld and Elterngeld (does
p_id_elternteil_1
orp_id_elternteil_2
take up the transfer?).