py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.84k stars 718 forks source link

DeepIV Value Overflow #60

Open ginward opened 5 years ago

ginward commented 5 years ago

The following is a slightly modification to the DeepIV notebook, but it no longer works. I guess it is because of the fact that I increased the max value of the X and T, which caused the overflow somewhere:

from econml.deepiv import DeepIVEstimator
from econml.bootstrap import BootstrapEstimator
import keras
import numpy as np
import matplotlib.pyplot as plt

get_ipython().run_line_magic('matplotlib', 'inline')

# ## Synthetic data
# 
# To demonstrate the Deep IV approach, we'll construct a synthetic dataset obeying the requirements set out above.  In this case, we'll take `X`, `Z`, `T` to come from the following distribution: 

# In[ ]:

# Initialize exogenous variables; normal errors, uniformly distributed covariates and instruments
e = np.random.normal(size=(445,1))
x = np.random.uniform(low=0.0, high=1000, size=(445,1))
z = np.random.uniform(low=0.0, high=1000, size=(445,1))

# Initialize treatment variable
t = np.sqrt((x+2) * z) + e

# Show the marginal distribution of t
plt.hist(t)
plt.xlabel("t")
plt.show()

plt.scatter(z[x < 1], t[x < 1], label='low X')
plt.scatter(z[(x > 4.5) * (x < 5.5)], t[(x > 4.5) * (x < 5.5)], label='moderate X')
plt.scatter(z[x > 9], t[x > 9], label='high X')
plt.legend()
plt.xlabel("z")
plt.ylabel("t")
plt.show()

# Here, we'll imagine that `Z` and `X` are causally affecting `T`; as you can see in the plot above, low or high values of `Z` drive moderate values of `T` and moderate values of `Z` cause `T` to have a bi-modal distribution when `X` is high, but a unimodal distribution centered on 0 when `X` is low.  The instrument is positively correlated with the treatment and treatments tend to be bigger at high values of x. The instrument has higher power at higher values of x 

# In[ ]:

# Outcome equation 
y = t*t / 10 - x*t / 10 + e

# The endogeneity problem is clear, the latent error enters both treatment and outcome equally
plt.scatter(t,z, label ='raw data')
tticks = np.arange(-2,12)
yticks2 = tticks*tticks/10 - 0.2 * tticks
yticks5 = tticks*tticks/10 - 0.5 * tticks
yticks8 = tticks*tticks/10 - 0.8 * tticks
plt.plot(tticks,yticks2, 'r--', label = 'truth, x=2')
plt.plot(tticks,yticks5, 'g--', label = 'truth, x=5')
plt.plot(tticks,yticks8, 'y--', label = 'truth, x=8')
plt.xlabel("t")
plt.ylabel("y")
plt.legend()
plt.show()

# `Y` is a non-linear function of `T` and `X` with no direct dependence on `Z` plus additive noise (as required).  We want to estimate the effect of particular `T` and `X` values on `Y`.
# 
# The plot makes it clear that looking at the raw data is highly misleading as to the treatment effect. Moreover the treatment effects are both non-linear and heterogeneous in x, so this is a hard problem!
# 
# ## Defining the neural network models
# 
# Now we'll define simple treatment and response models using the Keras `Sequential` model built up of a series of layers.  Each model will have an `input_shape` of 2 (to match the sums of the dimensions of `X` plus `Z` in the treatment case and `T` plus `X` in the response case).

# In[ ]:

treatment_model = keras.Sequential([keras.layers.Dense(128, activation='relu', input_shape=(2,)),
                                    keras.layers.Dropout(0.17),
                                    keras.layers.Dense(64, activation='relu'),
                                    keras.layers.Dropout(0.17),
                                    keras.layers.Dense(32, activation='relu'),
                                    keras.layers.Dropout(0.17)])

response_model = keras.Sequential([keras.layers.Dense(128, activation='relu', input_shape=(2,)),
                                   keras.layers.Dropout(0.17),
                                   keras.layers.Dense(64, activation='relu'),
                                   keras.layers.Dropout(0.17),
                                   keras.layers.Dense(32, activation='relu'),
                                   keras.layers.Dropout(0.17),
                                   keras.layers.Dense(1)])

# Now we'll instantiate the `DeepIVEstimator` class using these models.  Defining the response model *outside* of the lambda passed into constructor is important, because (depending on the settings for the loss) it can be used multiple times in the second stage and we want the same weights to be used every time.

# In[ ]:

keras_fit_options = { "epochs": 30,
                      "validation_split": 0.1,
                      "callbacks": [keras.callbacks.EarlyStopping(patience=2, restore_best_weights=True)]}

deepIvEst = DeepIVEstimator(n_components = 10, # number of gaussians in our mixture density network
                            m = lambda z, x : treatment_model(keras.layers.concatenate([z,x])), # treatment model
                            h = lambda t, x : response_model(keras.layers.concatenate([t,x])),  # response model
                            n_samples = 1, # number of samples to use to estimate the response
                            use_upper_bound_loss = False, # whether to use an approximation to the true loss
                            n_gradient_samples = 1, # number of samples to use in second estimate of the response (to make loss estimate unbiased)
                            optimizer='adam', # Keras optimizer to use for training - see https://keras.io/optimizers/ 
                            first_stage_options = keras_fit_options, # options for training treatment model
                            second_stage_options = keras_fit_options) # options for training response model

# ## Fitting and predicting using the model
# Now we can fit our model to the data:

# In[ ]:

boot_est = BootstrapEstimator(deepIvEst, n_bootstrap_samples=2, n_jobs=1)
boot_est.fit(Y=y,T=t,X=x,Z=z)

# And now we can create a new set of data and see whether our predicted effect matches the true effect `T*T-X*X`:

# In[ ]:

n_test = 500
for i, x_enum in enumerate([2, 5, 8]):
    t = np.linspace(0,10,num = 100)
    y_true = t*t / 10 - x_enum*t/10
    y_pred = boot_est.predict(t, np.full_like(t, x_enum))
    plt.plot(t, y_true, label='true y, x={0}'.format(x_enum),color='C'+str(i))
    plt.plot(t, y_pred, label='pred y, x={0}'.format(x_enum),color='C'+str(i),ls='--')
plt.xlabel('t')
plt.ylabel('y')
plt.legend()
plt.show()

# You can see that despite the fact that the response surface varies with x, our model was able to fit the data reasonably well. Where is does worst is where the instrument has the least power, which is in the low x case.  There it fits a straight line rather than a quadratic, which suggests that the regularization at least is perfoming well.  

# ## Estimating the Confidence Interval of the Model

# In[ ]:

treatment_effect_interval = boot_est.marginal_effect_interval(X=x, T=t, lower=1, upper=99)

Could you help and investigate why?

ginward commented 5 years ago

The Error output is:

WARNING:tensorflow:From /Users/jinhuawang/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 400 samples, validate on 45 samples
Epoch 1/30
400/400 [==============================] - 1s 1ms/step - loss: nan - val_loss: nan
Epoch 2/30
400/400 [==============================] - 0s 57us/step - loss: nan - val_loss: nan
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-9c6d6ed20ebe> in <module>
      1 boot_est = BootstrapEstimator(deepIvEst, n_bootstrap_samples=2, n_jobs=1)
----> 2 boot_est.fit(Y=y,T=t,X=x,Z=z)

~/Dropbox (Cambridge University)/EconML/econml/bootstrap.py in fit(self, *args, **named_args)
     56         Parallel(n_jobs=self._n_jobs, prefer='threads', verbose=3)(
     57             (obj.fit, [arg[inds] for arg in args], {arg: named_args[arg][inds] for arg in named_args})
---> 58             for obj, inds in zip(self._instances, indices)
     59         )
     60         return self

~/anaconda3/lib/python3.6/site-packages/joblib-0.13.2-py3.6.egg/joblib/parallel.py in __call__(self, iterable)
    919             # remaining jobs.
    920             self._iterating = False
--> 921             if self.dispatch_one_batch(iterator):
    922                 self._iterating = self._original_iterator is not None
    923 

~/anaconda3/lib/python3.6/site-packages/joblib-0.13.2-py3.6.egg/joblib/parallel.py in dispatch_one_batch(self, iterator)
    757                 return False
    758             else:
--> 759                 self._dispatch(tasks)
    760                 return True
    761 

~/anaconda3/lib/python3.6/site-packages/joblib-0.13.2-py3.6.egg/joblib/parallel.py in _dispatch(self, batch)
    714         with self._lock:
    715             job_idx = len(self._jobs)
--> 716             job = self._backend.apply_async(batch, callback=cb)
    717             # A job can complete so quickly than its callback is
    718             # called before we get here, causing self._jobs to

~/anaconda3/lib/python3.6/site-packages/joblib-0.13.2-py3.6.egg/joblib/_parallel_backends.py in apply_async(self, func, callback)
    180     def apply_async(self, func, callback=None):
    181         """Schedule a func to be run"""
--> 182         result = ImmediateResult(func)
    183         if callback:
    184             callback(result)

~/anaconda3/lib/python3.6/site-packages/joblib-0.13.2-py3.6.egg/joblib/_parallel_backends.py in __init__(self, batch)
    547         # Don't delay the application, to avoid keeping the input
    548         # arguments in memory
--> 549         self.results = batch()
    550 
    551     def get(self):

~/anaconda3/lib/python3.6/site-packages/joblib-0.13.2-py3.6.egg/joblib/parallel.py in __call__(self)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

~/anaconda3/lib/python3.6/site-packages/joblib-0.13.2-py3.6.egg/joblib/parallel.py in <listcomp>(.0)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

~/Dropbox (Cambridge University)/EconML/econml/deepiv.py in fit(self, Y, T, X, Z)
    329         model.compile(self._optimizer)
    330         # TODO: do we need to give the user more control over other arguments to fit?
--> 331         model.fit([Z, X, T], [], **self._first_stage_options)
    332 
    333         lm = response_loss_model(lambda t, x: self._h(t, x),

~/anaconda3/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1037                                         initial_epoch=initial_epoch,
   1038                                         steps_per_epoch=steps_per_epoch,
-> 1039                                         validation_steps=validation_steps)
   1040 
   1041     def evaluate(self, x=None, y=None,

~/anaconda3/lib/python3.6/site-packages/keras/engine/training_arrays.py in fit_loop(model, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
    215                         for l, o in zip(out_labels, val_outs):
    216                             epoch_logs['val_' + l] = o
--> 217         callbacks.on_epoch_end(epoch, epoch_logs)
    218         if callback_model.stop_training:
    219             break

~/anaconda3/lib/python3.6/site-packages/keras/callbacks.py in on_epoch_end(self, epoch, logs)
     77         logs = logs or {}
     78         for callback in self.callbacks:
---> 79             callback.on_epoch_end(epoch, logs)
     80 
     81     def on_batch_begin(self, batch, logs=None):

~/anaconda3/lib/python3.6/site-packages/keras/callbacks.py in on_epoch_end(self, epoch, logs)
    555                         print('Restoring model weights from the end of '
    556                               'the best epoch')
--> 557                     self.model.set_weights(self.best_weights)
    558 
    559     def on_train_end(self, logs=None):

~/anaconda3/lib/python3.6/site-packages/keras/engine/network.py in set_weights(self, weights)
    502         for layer in self.layers:
    503             num_param = len(layer.weights)
--> 504             layer_weights = weights[:num_param]
    505             for sw, w in zip(layer.weights, layer_weights):
    506                 tuples.append((sw, w))

TypeError: 'NoneType' object is not subscriptable
ginward commented 5 years ago

It will work if I set:

restore_best_weights=False

ginward commented 5 years ago

But it will give NaN value if I set restore_best_weights=True, which I guess it is not technically working ...

ginward commented 5 years ago

It seems that scaling down the value will work:

x = np.random.uniform(low=0.0, high=1000.0, size=(445,1))
x=x/100
z = np.random.uniform(low=0.0, high=1000.0, size=(445,1))
z=z/100
ginward commented 5 years ago

So I guess there is a limit on the maximum of the value for the Deep IV module?

kbattocchi commented 5 years ago

Thanks for reporting this - I'll try to take a closer look tomorrow.

yl3832 commented 4 years ago

I'm also getting NaNs, has this issues been resolved?

kbattocchi commented 4 years ago

@yl3832 When I investigated this previously, it appeared to be dependent on the the network architecture and initialization (e.g. the weights need to have lower variance as additional layers are added). Unfortunately I don't see any easy way to avoid this since we let the user specify the network and some choices may lead to issues like exploding gradients.

yl3832 commented 4 years ago

@kbattocchi Thank you Keith, I did tried different architectures in my use-case and it sometimes works. Also, normalization always helps.