iza-institute-of-labor-economics / gettsim

The GErman Taxes and Transfers SIMulator
https://gettsim.readthedocs.io/
GNU Affero General Public License v3.0
54 stars 32 forks source link

ENH: Use pointers to make target columns available at source level #708

Closed MImmesberger closed 5 months ago

MImmesberger commented 6 months ago

Is your feature request related to a problem?

Currently, pointers allow us to aggregate variables from source individuals (e.g. children) to target individuals (e.g. parents). Sometimes it is necessary to make target columns available on the source level (e.g. in the case of Unterhaltsvorschuss).

For more details, see the discussion we had here.

MImmesberger commented 6 months ago

Once this is implemented, we might be able to get rid of the conditions in kinderzuschl_bruttoeink_eltern_m and similar functions (see this comment).

lars-reimann commented 6 months ago

Should this be yet another addition to the complex name convention for parameters, like _ or how do you picture a general solution for this? Or should this be solved on a case-by-case basis?

Also, can't this be solved by a simple DataFrame.join where this connection is needed?

hmgaudecker commented 6 months ago

In principle this is a simple join, pinky that we don't have dataframes but numpy/Jax arrays.

The naming convention you propose sounds great!

lars-reimann commented 6 months ago

The naming convention you propose sounds great!

Ok, this would also need

  1. A distinct separator between foreign_key_column and target_column (not an underscore),
  2. A distinct keyword to indicate the desired interpretation of this parameter.

Both requirements could be fulfilled at once by having a unique infix, like <foreign_key_column>_to_<target_column>. Though we should probably pick something less likely to occur in other parameter names than _to_.

That said, I find the current name convention already quite hard to work with due to all the "magic" it encodes (time conversion, aggregation, renaming via dates_active, ...). Because of this, I'd rather suggest adding a utility function that takes three inputs:

  1. Array for the foreign keys.
  2. Array for the primary keys.
  3. Array for the targets.

This function can then return another array with the targets that belong to the respective foreign keys. Where needed, this function can be called explicitly. This is essentially the same as specifying a parameter using some name convention to implicitly invoke this function. If caching of the result is desired, the call can be wrapped into another function like _unterhaltsvors_empf_einkommen_m.

hmgaudecker commented 6 months ago

I don't follow all the way yet, I am afraid. How would that work with the DAG? How would somebody who wanted to extend GETTSIM call that function?

I'd also be fine with doing something provisional right now and then re-consider when re-designing the interface after the move to a hierarchical scheme.

lars-reimann commented 6 months ago

Let's say we go with some name convention and add a parameter p_id_unterhaltsvors_empf_to_einkommen_m to some policy function:

  1. This would implicitly add dependencies to the columns p_id_unterhaltsvors (foreign key), p_id (primary key), and einkommen_m (target). This does not seem obvious to users and needs special treatment on the implementation side.
  2. No such function exists anywhere, so it needs to be created on-the-fly. The created function will internally call something like join(p_id_unterhaltsvors, p_id, einkommen_m).

So what I'm asking is why we don't make the dependency to the three columns and the join call explicit:

def p_id_unterhaltsvors_empf_to_einkommen_m(
    p_id_unterhaltsvors,
    p_id,
    einkommen_m,
):
    return join(p_id_unterhaltsvors, p_id, einkommen_m)

The function p_id_unterhaltsvors_empf_to_einkommen_m becomes part of the DAG as usual and can be "called" by adding a parameter with the same name to another function. Likewise, it can be overwritten by data columns with the same name. Now, this function can also have any name, so we could also call it, say, _unterhaltsvors_empf_einkommen_m.

In this case, we would only provide the join function but not augment the name convention.

hmgaudecker commented 6 months ago

That is fine, but we would need a mechanism of not calling jnp.vectorize (or whatever it is called) on these functions, right? So far, all functions in GETTSIM (except for those special ones like aggregation etc.) are built on the premise that they are written for individual rows of the data. Hence my confusion.

hmgaudecker commented 6 months ago

But again,

I'd also be fine with doing something provisional right now and then re-consider when re-designing the interface after the move to a hierarchical scheme.

lars-reimann commented 6 months ago

We could introduce another decorator to skip vectorization.

hmgaudecker commented 6 months ago

Fine, if you prefer that route, let's take it. When redesigning the interface, we should re-visit the whole structure and see what makes most sense across all these different functions.