uber / causalml

Uplift modeling and causal inference with machine learning algorithms
Other
4.87k stars 756 forks source link

Fix missing value when calculate cumulative lift #707

Closed jpansnap closed 7 months ago

jpansnap commented 7 months ago

Proposed changes

Noticed that if there are missing values for the outcome column when calculate cumulative lift, the current .cumsum() function default as skipna=False which will return NaN when there are missing values as a results the calculation of lift could be off

image

Propose change to .fillna(0).cumsum(skipna=True).astype(float) (change to float to avoid the division by zero error as well)

image

Types of changes

What types of changes does your code introduce to CausalML? Put an x in the boxes that apply

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

Further comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc. This PR template is adopted from appium.

ras44 commented 7 months ago

Just took a quick look at this: could the same be achieved using only the .fillna(0) element and skipping the skipna=True and .astype(float)?

jpansnap commented 7 months ago

Just took a quick look at this: could the same be achieved using only the .fillna(0) element and skipping the skipna=True and .astype(float)?

Just .fillna(0) will give the division by zero error for example the first record is missing - that's why used .asype(float)

Adding skipna=True just trying to be more explicit and also in case some weird cased that filling na didn't work

vincewu51 commented 7 months ago

We had a discussion this morning, just want to provide some of our discussion here:

  1. it seems to me fillna(0) should be a decision by the user. Should we consider to provide the option to the user what is the value they want to fill
  2. If there is NA in the data, shouldn't the user handle it at the beginning of the model instead of at the plotting stage? Maybe we should just throw an error and suggest user to handle NA by themselves.
  3. To the very least, we should let user know their data has NA and causalml filled it.

What do you think?

jpansnap commented 7 months ago

We had a discussion this morning, just want to provide some of our discussion here:

  1. it seems to me fillna(0) should be a decision by the user. Should we consider to provide the option to the user what is the value they want to fill
  2. If there is NA in the data, shouldn't the user handle it at the beginning of the model instead of at the plotting stage? Maybe we should just throw an error and suggest user to handle NA by themselves.
  3. To the very least, we should let user know their data has NA and causalml filled it.

What do you think?

This is a quick fix as the main issue right now for this function which will just return empty values - not only plotting but also the AUUC will be off.

For 2. For my use case, I didn't use CausalML for modeling but just plotting - I would assume we want to provide support for individual module as well. For 1 and 3 - we can modify the code to be more robust please let me know which way we'd prefer to.

vincewu51 commented 7 months ago

We had a discussion this morning, just want to provide some of our discussion here:

  1. it seems to me fillna(0) should be a decision by the user. Should we consider to provide the option to the user what is the value they want to fill
  2. If there is NA in the data, shouldn't the user handle it at the beginning of the model instead of at the plotting stage? Maybe we should just throw an error and suggest user to handle NA by themselves.
  3. To the very least, we should let user know their data has NA and causalml filled it.

What do you think?

This is a quick fix as the main issue right now for this function which will just return empty values - not only plotting but also the AUUC will be off.

For 2. For my use case, I didn't use CausalML for modeling but just plotting - I would assume we want to provide support for individual module as well. For 1 and 3 - we can modify the code to be more robust please let me know which way we'd prefer to.

Got it, how about we go with option 1(provide the option to the user)? It is more robust and gives users more flexibility.

jeongyoonlee commented 7 months ago

This is a quick fix as the main issue right now for this function which will just return empty values - not only plotting but also the AUUC will be off.

For 2. For my use case, I didn't use CausalML for modeling but just plotting - I would assume we want to provide support for individual module as well. For 1 and 3 - we can modify the code to be more robust please let me know which way we'd prefer to.

What about throwing an error with a message informing that there's NaN in the input instead of returning null? It's not a good practice to impute NaN silently. Another option is that we can add an input argument for users to specify whether to impute NaN or not with the default being no imputation.

jpansnap commented 7 months ago

This is a quick fix as the main issue right now for this function which will just return empty values - not only plotting but also the AUUC will be off. For 2. For my use case, I didn't use CausalML for modeling but just plotting - I would assume we want to provide support for individual module as well. For 1 and 3 - we can modify the code to be more robust please let me know which way we'd prefer to.

What about throwing an error with a message informing that there's NaN in the input instead of returning null? It's not a good practice to impute NaN silently. Another option is that we can add an input argument for users to specify whether to impute NaN or not with the default being no imputation.

Yeah agree, we could throw the error and add the input argument. Also saw the comment in https://github.com/uber/causalml/pull/720