pytorch / captum

Model interpretability and understanding for PyTorch
https://captum.ai
BSD 3-Clause "New" or "Revised" License
4.96k stars 499 forks source link

Migrating from captum 0.1.0 to 0.2.0: expand_additional_forward_args #375

Closed tsKenneth closed 4 years ago

tsKenneth commented 4 years ago

Hi all, I have recently met with trouble when migrating from captum 0.1.0 to 0.2.0. When I am using the attribution function, I am passing a sparse tensor as additional forward arguments.

However in 0.2.0, it appears that it expands this tensor in (_expand_additional_forward_args). My argument with original size of [64,64] became [128,64], thereby causing my forward functions to fail because it expects a [64,64] in matrix multiplication. I see that this is also applied to the input tensor as well. Is there a way to disable this function or are there any workarounds? Thanks for your help.

vivekmig commented 4 years ago

Hi @tsKenneth, you would likely be able to workaround this issue with a wrapper function that's something like this:

def wrapper_model(inp, add_arg_dict):
    return original_model(inp, add_arg_dict["sparse_arg"])

Attribution with a method like Integrated Gradients would then be something like:

ig = IntegratedGradients(wrapper_model)
attr = ig.attribute(inp, {"sparse_arg": sparse_inp})

For DeepLift, the forward function must be a model itself in order to apply hooks to the submodules, so you would need to do something similar with a wrapper model instead.

class WrapperModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = OriginalModel()

    def forward(self, inp, add_arg_dict):
        return self.model(inp, add_arg_dict["sparse_arg"])

This solution avoids the expansion of additional forward args by providing it within a dictionary, since only Tensor additional arguments get expanded, but any other python object will remain the same when passed as an additional arg (note that any other tensors in the dictionary would also not be expanded).

I think the only method where this behavior has changed between Captum 0.1.0 to 0.2.0 is DeepLift, so I think you're probably using DeepLift? If so, the functionality has switched to passing inputs and baselines concatenated in one batch rather than separating them into 2 batches processed separately. Because of this, it would be good to ensure that the output of your model when the inputs and baselines are concatenated as a single batch (with the additional sparse tensor) matches concatenating the outputs of processing the inputs and baselines separately.

tsKenneth commented 4 years ago

@vivekmig Yes I am using DeepLIFT. I don't understand the proper use of the functionality, could I trouble you to elaborate? I am currently using zero tensor as baseline for DeepLIFT.

I have tried the workaround with the wrapper functions, but captum still expands my input tensor so it would not work in that case.

Thank you for your help.

vivekmig commented 4 years ago

Hi @tsKenneth , here's an example for the difference in functionality. Consider an example model which takes an input of size Nx3 and output Nx2 (batch-size N). With a simple 2x3 input, when running DeepLift with the approach implemented in 0.1.0. the model was first processed on the input, and layer activations were stored:

input = [[1.0, 2.0, 3.0],    ->    output = [[0.9, 0.1],
         [4.0, 5.0, 6.0]]                    [0.2, 0.8]]

The model was then processed again on the baseline (defaults to zero):

input = [[0.0, 0.0, 0.0],    ->    output = [[0.5, 0.5],
         [0.0, 0.0, 0.0]]                    [0.5, 0.5]]

The activation differences are used to compute the DeepLift attributions.

With the current version 0.2.0 implementation, the inputs and baselines are concatenated in one batch, and processed as follows:

input = [[1.0, 2.0, 3.0],    ->    output =  [[0.9, 0.1],
         [4.0, 5.0, 6.0],                     [0.2, 0.8]
         [0.0, 0.0, 0.0]                      [0.5, 0.5],
         [0.0, 0.0, 0.0]]                     [0.5, 0.5]]

The intermediate layer activations are internally separated into two halves corresponding to inputs and baselines. This change was made to more easily support DataParallel models. Because the batch size is now double the original, additional forward args which are tensors are by default expanded to double their size to match the larger batch size. This can be avoided by putting the additional forward args in a non-tensor object, which is not modified.

Still, to work with the new approach, the expanded output as shown should be the same as concatenating the outputs of processing the inputs and baselines separately as shown above.

Hope this example helps! If you can provide more details on the input structure of your model or a code example / Colab notebook, we can try to help further.

tsKenneth commented 4 years ago

Hi @vivekmig . Thank you for the extensive elaboration.

I am currently running a graph convolution neural network, which predicts whether a graph is 0 or 1 as output.

The graph convolutional neural network takes in a input feature map X that has a size of n x m where n is the number of nodes and m is the feature dimension per node. For most of my datasets, there is only node labels which is processed into a one-hot embedding to act as the node feature. Hence, m is the number of unique node labels.

My additional forward argument has an adjacency matrix to perform graph convolution. It is a sparse matrix of size n x n. Given the current implementation of DeepLIFT, appending a zero tensor to my feature map X at dimension = 0 would work, but simply expanding the adjacency matrix on dim = 0 to create a sparse matrix of 2n x n would not work because they are technically two separate graphs. To make a compromise, I would have to process it such that the adjacency matrix will expand diagonally, to become a sparse matrix of size 2n x 2n.

Moreover, the rest of the neural network model would consider the appended zero tensor to be part of the original graph I am calculating the attribution score for; the layer before the softmax output is still logits of size 1 x number_of_class instead of the 2 x number_of_class as required by Captum. I also have additional novel layers that sorts all nodes according to their node features, which may be misleading in the case where we simply append the zero tensor.

I don't see a easy workaround for DeepLIFT in 0.20 but I require some of the additional features provided such as Layer Grad Cam. Hence, I would think that the best way is to do version switching?

vivekmig commented 4 years ago

Hi @tsKenneth , makes sense, I see why the new implementation structure of DeepLift doesn't work for your model. For now, using the implementation from 0.1.0 would likely be the best option. We will consider if there's a way we can still support the original approach for future versions.

To avoid version switching for Layer GradCAM, you could install from GitHub with a commit containing the old DeepLift approach but having Layer GradCAM. Particularly, I think this commit c829ee5 is the last one before the change in DeepLift. There are instructions in the main Captum readme file for installing Captum from GitHub source, the same should work, with just checking out this commit before installing.

Also, your use case would likely be supported in algorithms that have the internal_batch_size argument, such as Integrated Gradients, even though they also internally expand the input. Setting this argument equal to the original input batch size (n) would appropriately chunk the expanded input. We currently don't have this for DeepLift, but if you're interested in testing with those algorithms, this may be helpful.

tsKenneth commented 4 years ago

I have overcome this obstacle by version switching, since the repository I am using it with requires it to be easily set up using pip. I will look forward to the internal_batch_size argument feature in the future. Thank you @vivekmig for your help.