tensorflow / decision-forests

A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.
Apache License 2.0
660 stars 110 forks source link

Sample numerical uplift #181

Closed vitorsrg closed 1 year ago

vitorsrg commented 1 year ago

Hi! Do you have a working snippet of uplift mode? I couldn't find any.

I've tried this very simple implementation:

import tensorflow_decision_forests as tfdf
import pandas as pd

df = pd.DataFrame(
    [
        [0.1, 0.2, 0.3],
        [0.4, 0.5, 0.6],
        [0.7, 0.8, 0.9],
    ],
    columns=list("abc"))
ds = tfdf.keras.pd_dataframe_to_tf_dataset(
    df,
    label="a",
    task=tfdf.keras.Task.NUMERICAL_UPLIFT)
model = tfdf.keras.GradientBoostedTreesModel(
    task=tfdf.keras.Task.NUMERICAL_UPLIFT,
    uplift_treatment="b")
model.fit(ds)

However, it fails with the following output:

2023-06-20 01:54:38.103514: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Use /var/folders/wn/x8pf0vlx58j_291rhkkk1p640000gn/T/tmpq7bwoiel as temporary training directory
[WARNING 23-06-20 01:54:43.2767 -03 gradient_boosted_trees.cc:1797] "goss_alpha" set but "sampling_method" not equal to "GOSS".
[WARNING 23-06-20 01:54:43.2789 -03 gradient_boosted_trees.cc:1808] "goss_beta" set but "sampling_method" not equal to "GOSS".
[WARNING 23-06-20 01:54:43.2789 -03 gradient_boosted_trees.cc:1822] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
Reading training dataset...
2023-06-20 01:54:43.317147: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_2' with dtype double and shape [3]
         [[{{node Placeholder/_2}}]]
Training dataset read in 0:00:04.492206. Found 3 examples.
Training model...
2023-06-20 01:54:47.804256: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at kernel_long_process.cc:152 : UNKNOWN: TensorFlow: INVALID_ARGUMENT: No defined default loss for this combination of label type and task
Traceback (most recent call last):
  File "./tfdf.py", line 19, in <module>
    model.fit(ds)
  File ".../tensorflow_decision_forests/keras/core.py", line 1257, in fit
    return self._fit_implementation(
  File ".../tensorflow_decision_forests/keras/core.py", line 1614, in _fit_implementation
    self._train_model(cluster_coordinator=coordinator)
  File ".../tensorflow_decision_forests/keras/core.py", line 2090, in _train_model
    tf_core.train(
  File ".../tensorflow_decision_forests/tensorflow/core.py", line 568, in train
    training_op.SimpleMLCheckStatus(process_id=process_id) == 1
  File ".../tensorflow/python/util/tf_export.py", line 413, in wrapper
    return f(**kwargs)
  File "<string>", line 1371, in simple_ml_check_status
  File ".../tensorflow/python/framework/ops.py", line 7262, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.UnknownError: {{function_node __wrapped__SimpleMLCheckStatus_device_/job:localhost/replica:0/task:0/device:CPU:0}} TensorFlow: INVALID_ARGUMENT: No defined default loss for this combination of label type and task [Op:SimpleMLCheckStatus]
rstz commented 1 year ago

Just as a quick update: A tutorial is in the works and we've also also worked on improving the error messages for the next version.

Your example does not work because

Your example would therefore work when changing it to

import tensorflow_decision_forests as tfdf
import pandas as pd

df = pd.DataFrame(
    [
        [0.1, 1, 0.3],
        [0.4, 1, 0.6],
        [0.7, 0, 0.9],
        [1.0, 0, 0.9],
    ],
    columns=list("abc"))
ds = tfdf.keras.pd_dataframe_to_tf_dataset(
    df,
    label="a",
    task=tfdf.keras.Task.NUMERICAL_UPLIFT)
model = tfdf.keras.RandomForestModel(
    task=tfdf.keras.Task.NUMERICAL_UPLIFT,
    uplift_treatment="b")
model.fit(ds)

But stay tuned for a full tutorial :)

vitorsrg commented 1 year ago

Hi, thanks for the reply

The treatment column needs to be a 0 or 1 variable (does not have treatment or has treatment)

I was expecting NUMERICAL and CATEGORICAL uplift to have continuous and categorical/discrete treatment respectively. It would be helpful to have their differences and use cases explicit in the tutorial then

rstz commented 1 year ago

I was expecting NUMERICAL and CATEGORICAL uplift to have continuous and categorical/discrete treatment respectively

That's good to know, I've added a paragraph in the tutorial about the difference: Numerical and Categorical indeed specify the type of outcome, not the type of treatment in the problem.

The tutorial is now available in documentation/tutorials/uplift_colab.ipynb and will be available on the Tensorflow website once that's updated (a few days?) Happy to hear feedback!