iza-institute-of-labor-economics / gettsim

The GErman Taxes and Transfers SIMulator
https://gettsim.readthedocs.io/
GNU Affero General Public License v3.0
54 stars 32 forks source link

BUG: `join_numpy` return if foreign key is < 0 #740

Closed MImmesberger closed 5 months ago

MImmesberger commented 5 months ago

In the Unterhaltsvorschuss module, we use the join_numpy function to determine whether the parent that receives Kindergeld for a specific child is a single parent:

def parent_alleinerz(
    p_id_kindergeld_empf: np.ndarray[int],
    p_id: np.ndarray[int],
    alleinerz: np.ndarray[bool],
):
    return join_numpy(p_id_kindergeld_empf, p_id, alleinerz)

This returns True if p_id_kindergeld_empf is not determined, i.e. set to -1. True might be a bad default in this case.

Proposed solution

Allow for a default in the join_numpy function that is returned if the foreign key is not determined.

hmgaudecker commented 5 months ago

IIUC, it does not return True in all cases, but the last value of the array ([-1]).

I think the behaviour should special-case foreign keys below zero and return our default for missing values (-1 in case of ints, np.nan in case of floats, error (?) for bool).

We could also achieve that via an extra argument as @MImmesberger suggested, I'd be fine with both.

MImmesberger commented 5 months ago

Inferring the default value from the target column data type sounds good to me. However, we don't want errors for bools (in my case, I use the parent ID as a foreign key, so there will always be missings). Maybe we can set the default explicitly for bools only? The correct default probably depends on the context.

hmgaudecker commented 5 months ago

If we need missings, we'll need to convert bools to int at the column/function level so long as there is no Jax support for missings.

MImmesberger commented 5 months ago

Just in case there is a misunderstanding: The missings that I was referring to are the -1s of p_id_elternteil_x. In my mind, the new column won't have missings because we set defaults if the foreign key is -1.

hmgaudecker commented 5 months ago

Just in case there is a misunderstanding: The missings that I was referring to are the -1s of p_id_elternteil_x. In my mind, the new column won't have missings because we set defaults if the foreign key is -1.

Fair enough, the best thing is to be explicit indeed.