Multi-input model and request for general advice.

durrantmm commented 4 years ago

Hello, I have trained a multi-input (all inputs are DNA sequences) model using Keras that I would like to analyze with deeplift. I have > 200,000 examples, and I would like to use deeplift on all of them. What do I need to do to analyze my model with DeepLIFT? What would be a good strategy for applying deeplift to all of my examples with a shuffled reference? I would also like to search for motifs with TF-MoDisco at a future stage in my analysis if possible.

AvantiShri commented 4 years ago

Hi @durrantmm,

Do you think it makes sense to shuffle all your input modes, or just shuffle some input modes while keeping others fixed at their original value? The former would highlight important positions in all the input modes, while the latter would only highlight important positions in the input modes that are shuffled.

Second, is your model architecture compatible with the original DeepLIFT implementation, or would you likely have to use DeepSHAP (an extension of DeepLIFT implemented in the shap repo that is compatible with more architectures)? If you are willing to paste your architecture here, I can tell you the answer.

durrantmm commented 4 years ago

Hi Avanti, thanks for the quick response. Shuffling some inputs and not others is an interesting idea that I would have to think about, but for now, I think shuffling all three inputs should suffice.

Happy to share my architecture:

durrantmm commented 4 years ago

When I try to load it using the kc.convert_model_from_saved_files function, I get the error

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-20-889e5f2d21de> in <module>()
     10         h5_file=keras_model_weights,
     11         json_file=keras_model_json,
---> 12         nonlinear_mxts_mode=nonlinear_mxts_mode)

21 frames
/usr/local/lib/python3.6/dist-packages/deeplift/layers/core.py in _compute_shape(self, input_shape)
    611                 assert len(set(lengths_for_that_dim))==1,\
    612                        "lengths for dim "+str(dim_idx)\
--> 613                        +" should be the same, got: "+str(lengths_for_that_dim)
    614                 shape.append(lengths_for_that_dim[0])
    615             else:

AssertionError: lengths for dim 1 should be the same, got: [2176, 704, 88]

durrantmm commented 4 years ago

Any updates on this?

AvantiShri commented 4 years ago

Yeah, I am planning to send you some sample code but haven’t found the bandwidth to put it together yet. I plan to get to it over the weekend. Thanks for reminder

AvantiShri commented 4 years ago

@durrantmm here's an example notebook showing how to get explanations using the DeepSHAP implementation of DeepLIFT and your model architecture: https://github.com/AvantiShri/colab_notebooks/blob/master/misc_examples/Example_of_Multiple_Sequence_Input_Modes_With_DeepSHAP.ipynb

Let me know if it satisfies your needs. I've put comments in to explain the less intuitive parts of the notebook. The core idea is that you have to define a function that will generate the set of references to use for each example.

DeepSHAP does not implement the RevealCancel rule (it just uses DeepLIFT's Rescale rule for all the nonlinearities in your architecture) - however, in my internal benchmarking on genomic data, I actually don't see a large difference from using the RevealCancel rule when shuffled sequences are used for the reference (RevealCancel seems to help when using a constant reference, but the constant reference tends to do worse than a shuffled reference).

One downside of the DeepSHAP implementation of DeepLIFT is that every batch only consists of the references for one example, so the batching might not be the most efficient. The dinucleotide shuffling might also be a bottleneck - I put some comments in the notebook with thoughts of how it might be sped up.

Let me know if anything is unclear!

durrantmm commented 4 years ago

Fantastic, thank you!

durrantmm commented 4 years ago

I got it working with my model and data, again thank you very much for this.

Do you have advice for using this with TFModisco? Do you suspect I'll be able to figure that out pretty easily?

AvantiShri commented 4 years ago

Hi @durrantmm, usually people are able figure TF-MoDISco by following this notebook: https://github.com/kundajelab/tfmodisco/blob/master/examples/simulated_TAL_GATA_deeplearning/TF_MoDISco_TAL_GATA.ipynb

But do me know if you run into any issues!

durrantmm commented 4 years ago

Hi, thanks for the information. I have successfully calculated contribution scores, it's working great! I've been trying to get hypothetical scores for my data. I have been using this notebook in your fork of the shap repository as a guide.

It doesn't seem like it's compatible with the dimensions of my data. Here is the function that I have been trying to modify:

#This combine_mult_and_diffref function can be used to generate hypothetical
# importance scores for one-hot encoded sequence.
#Hypothetical scores can be thought of as quick estimates of what the
# contribution *would have been* if a different base were present. Hypothetical
# scores are used as input to the importance score clustering algorithm
# TF-MoDISco (https://github.com/kundajelab/tfmodisco)
# Hypothetical importance scores are discussed more in this pull request:
#  https://github.com/kundajelab/deeplift/pull/36
def combine_mult_and_diffref(mult, orig_inp, bg_data):
    to_return = []
    for l in range(len(mult)):
        projected_hypothetical_contribs = np.zeros_like(bg_data[l]).astype("float")
        assert len(orig_inp[l].shape)==2
        #At each position in the input sequence, we iterate over the one-hot encoding
        # possibilities (eg: for genomic sequence, this is ACGT i.e.
        # 1000, 0100, 0010 and 0001) and compute the hypothetical 
        # difference-from-reference in each case. We then multiply the hypothetical
        # differences-from-reference with the multipliers to get the hypothetical contributions.
        #For each of the one-hot encoding possibilities,
        # the hypothetical contributions are then summed across the ACGT axis to estimate
        # the total hypothetical contribution of each position. This per-position hypothetical
        # contribution is then assigned ("projected") onto whichever base was present in the
        # hypothetical sequence.
        #The reason this is a fast estimate of what the importance scores *would* look
        # like if different bases were present in the underlying sequence is that
        # the multipliers are computed once using the original sequence, and are not
        # computed again for each hypothetical sequence.
        for i in range(orig_inp[l].shape[-1]):
            hypothetical_input = np.zeros_like(orig_inp[l]).astype("float")
            hypothetical_input[:,i] = 1.0
            hypothetical_difference_from_reference = (hypothetical_input[None,:,:]-bg_data[l])
            hypothetical_contribs = hypothetical_difference_from_reference*mult[l]
            projected_hypothetical_contribs[:,:,i] = np.sum(hypothetical_contribs,axis=-1) 
        to_return.append(np.mean(projected_hypothetical_contribs,axis=0))
    return to_return

When I use this function in explainer.shap_values with my multi-input model, mult is a list of three arrays for my three input modes. If I just look at the first input mode, I have a 20x153x4 array, where 20 corresponds to the 20 shuffled references. These dimensions are also true of the bg_data parameter. But where things start to get confusing is with my orig_inp parameter., which is a list of three inputs for the three input modes, and the first entry is an array with dimension 153x3, th original input sequence.

The code as it is written above assumes that mult and orig_inp are the same shape, as with the lines:

for l in range(len(mult)):
    ...
    for i in range(orig_inp[l].shape[-1]):
        ...
    ...

But this is simply not the case with my model. I've been trying to modify the code to fit my own model, but I run into issues stemming from my poor understanding of how hypothetical contributions are calculated. I can keep working on this, but if you have some quick advice that would be much appreciated.

AvantiShri commented 4 years ago

Hi, so l is an index into the input mode, i.e. len(mult) and len(orig_input) should both be 3. I believe that’s the only thing assumed by the lines you excerpted. Can you let me know the error message you encounter when you try to use the function as-is?

durrantmm commented 4 years ago

Oh wow, yeah it worked great, that makes much more sense, thank you!

Mirabar commented 4 years ago

Thank you for the notebook (: It helped me a lot as well. I have a follow up question, my model has four inputs, three sequences and one integer input. The integer is concatenated to the sequences after the go through conv layers:

Layer (type) Output Shape Param # Connected to

input_2 (InputLayer) (None, 50, 4) 0

input_1 (InputLayer) (None, 2338, 4) 0

input_3 (InputLayer) (None, 50, 4) 0

conv1d_2 (Conv1D) (None, 50, 128) 3200 input_2[0][0]

conv1d_1 (Conv1D) (None, 2338, 256) 12544 input_1[0][0]

conv1d_3 (Conv1D) (None, 50, 128) 3200 input_3[0][0]

max_pooling1d_2 (MaxPooling1D) (None, 1, 128) 0 conv1d_2[0][0]

max_pooling1d_1 (MaxPooling1D) (None, 1, 256) 0 conv1d_1[0][0]

max_pooling1d_3 (MaxPooling1D) (None, 1, 128) 0 conv1d_3[0][0]

lambda_2 (Lambda) (None, 128) 0 max_pooling1d_2[0][0]

lambda_1 (Lambda) (None, 256) 0 max_pooling1d_1[0][0]

lambda_3 (Lambda) (None, 128) 0 max_pooling1d_3[0][0]

input_4 (InputLayer) (None, 1) 0

concatenate_1 (Concatenate) (None, 513) 0 lambda_2[0][0]
lambda_1[0][0]
lambda_3[0][0]
input_4[0][0]

dense_1 (Dense) (None, 32) 16448 concatenate_1[0][0]

activation_1 (Activation) (None, 32) 0 dense_1[0][0]

dense_2 (Dense) (None, 1) 33 activation_1[0][0]

activation_2 (Activation) (None, 1) 0 dense_2[0][0]

The unprojected contributions I receive for the integer input are all zeroes. Is there an explanation for this? Do the scores indicate something about the relationship between the inputs?

AvantiShri commented 4 years ago

@Mirabar "The unprojected contributions I receive for the integer input are all zeroes" - is the reference/baseline value for the integer input set to be something different from its actual value? If the reference value is set to the actual value, then the contribution would be 0 by definition.

Mirabar commented 4 years ago

I've just realized that this is exactly what I was doing. Any suggestion on choosing a reference for an integer input that will work with providing shuffled references for the sequence inputs?

AvantiShri commented 4 years ago

Hi @Mirabar, answering this question requires familiarity with the domain. The network will compute attribution scores for how deviations from this baseline impact the output, so think about what kinds of deviations are of interest in your application. Some suggestions: you might consider the mean value of the integer to be a good baseline, or the mean across some subset of "control" examples. Whatever value you choose for the baseline, as a sanity check you should look at the output the network gives when supplied that baseline and make sure that the output corresponds to what you would think of as a good "baseline" output. Does that help?

Mirabar commented 4 years ago

Thank you so much for your response! Gave me a lot to think about (:

kundajelab / deeplift

Multi-input model and request for general advice. #103