kundajelab / tfmodisco

TF MOtif Discovery from Importance SCOres
MIT License
124 stars 29 forks source link

Mean-normalize hypothetical contributions #113

Open FelixWaern opened 8 months ago

FelixWaern commented 8 months ago

I'm currently using TF-MoDISco and am considering mean-normalization since it's written in one of the examples in the Github repository that it is suspected it might improve the results. I was just wondering in what way mean-normalization would improve the results and why?

AvantiShri commented 8 months ago

Hi Felix,

The short answer is that if the hypothetical importance scores are not mean-normalized, it implies that a given position in the sequence carries net positive or net negative importance irrespective of which base may be present at that position. This could be considered misleading, because it implies that the importance scoring algorithm is treating the position like a "bias" term by assigning some baseline contribution to that position irrespective of what base is present.

This "bias" effect is most obvious if using something like the gradients of a deep learning model as the algorithm for scoring the hypothetical importances; if a model has placed a negative gradient on all possible bases at a given position, it is literally using the position like a bias term, because it knows that at least one of the bases must be non-zero (assuming one-hot encoded input), and thus there will be some negative contribution from that position irrespective of which base is present. If you want to get rid of these bias-like effects, it's important to mean-normalize the gradients at that position before treating them like importance scores.

Which importance scoring algorithm are you using? I'm guessing it's gkmexplain since I think that was the notebook in which I did the mean normalization? Different importance scoring algorithms are more susceptible to showing these bias-like effects. In particular, deeplift with multiple shuffled references shouldn't show strong effects like this, provided that the scores have been projected onto the bases correctly. I bring this up because during my PhD I saw that it was very common for people to mess up the projecting-onto-bases step when using an algorithm like deeplift/deepshap, so I'm happy to give more details on that if you like. Here is a description from the current writeup of the tfmodisco paper (this is not directly answering your question about the mean normalization, but rather it discusses the importance of projecting the scores onto the bases correctly, which can be a source of poor-quality scores, hence why I am mentioning this--if not you, very likely some reader of this issue may benefit):

Calculation of “actual” importance scores

For a sequence of length L (where L is allowed to vary between the sequences), the “actual” importance scores are an array of length L x 4, where the second dimension corresponds to the ACGT axis and at most one of the four elements is nonzero at each of the L positions (the element that is allowed to be nonzero corresponds to the base present at that position in the sequence; it is assumed that sequences containing ambiguous bases have been excluded for the purposes of motif discovery). Each non-zero entry represents the contribution of the corresponding base to the output; positive scores indicate the base appeared to be pushing the output higher, and negative scores indicate the base was pushing the output lower.

What are some methods that can be used to obtain this L x 4 importance score array? One of the simplest methods that can be used is to simply compute the gradient of the output with respect to the one-hot encoded input, and then multiply the gradients with the one-hot encoded input in order to mask out the gradients at bases that were not present in the original sequence. This is the “gradient x input” method, and to our knowledge it was first described in Shrikumar et al., 2016. Such an approach can be considered a special case of a family of methods that computes contribution scores of the form “multiplier x difference-of-input-from-reference”; in the case of “gradient x input”, the “multiplier” is simply the gradient, and the “reference” is the all-zero input, which means that the difference-of-input-from-reference term becomes the same as the input. While easy to implement, this approach has limitations (as discussed in Shrikumar et al., 2017 and Prakash et al. 2022), and so other choices of the multiplier and/or reference are often desirable.

Computing the L x 4 actual importance score array when using a non-zero reference

When using an importance scoring method that computes contribution scores in terms of the difference of the input from a reference input, some care is needed when employing a reference that is not all-zero (which one may often wish to do for reasons highlighted in Prakash et al. 2022). The reason is that the contribution scores in such cases may be nonzero for more than one base at a given position. To understand why, note first that this situation does not occur when using an all-zero reference because the difference-from-reference when using an all-zero reference is automatically zero at bases encoded as zero. By contrast, if we suppose the actual one-hot encoding at a particular position is [1,0,0,0] (representing the base “A”), and the reference sequence at that position is [0,1,0,0] (representing the base “C”; if the use of the concrete base “C” for the reference appears odd to the reader, note that when non-zero references are used, importance scores are typically computed using many different reference sequences and averaged), then the difference-from-reference will be [1, -1, 0, 0], corresponding to the presence of “A” and the absence of “C”. Thus, the -1 representing the “absence of C” can be assigned a contribution score, just as the 1 indicating the presence of A can be assigned a contribution score.

How can we then meaningfully convert an L x 4 array that may have multiple non-zero entries at each of the L positions into an L x 4 array that is non-zero at only one position? In this work, we take the approach of simply adding up the contribution scores of all four elements at each position and projecting this sum onto the single base that is actually present in the sequence at that position. The justification for this approach, to build on the example above, is that the presence of “A” at a given position automatically entails the absence of “C”. Concretely, if the importance-scoring method assigned the multipliers [a, b, c, d] to the position discussed above, the raw contribution scores (computed as multipliers x difference-of-input-from-reference) would be [a, -b, 0, 0], and we would represent the raw contribution scores as [a - b, 0, 0, 0] to achieve our desired representation that is nonzero at all but one position.

Calculation of hypothetical importance scores

The purpose of the hypothetical importance scores is to provide additional information on what type of patterns the model is looking for when it scans the sequence. For a sequence of length L, the hypothetical importance scores are provided as an array of length L x 4 where all positions are allowed to be nonzero.

What are good candidates to use for hypothetical importance scores? If the “actual” importance scores were computed using the “gradient x input” method (a special case of the “multiplier x difference-of-input-from-reference” family where the multipliers are the gradients and the reference is all-zero), then “hypothetical” importance scores could be obtained by simply using the gradient w.r.t. the one-hot encoded input. Concretely, imagine we have a prediction task where a motif of the form GAT[T/A]A is relevant, and imagine the 5-mer “GATAA” is present in the input sequence and is recognized as being likely bound by the model; we would expect the gradients to highlight both the “T” and the “A” at the fourth position in the 5-mer, even though the actual importance scores would only highlight “A” (as “A” is the base that is actually present in the sequence). During clustering, this would help group GATTA instances with GATAA instances rather than splitting them into two separate motifs. It is also possible to obtain hypothetical importance scores when using gapped-k-mer support vector machines as the model, as described in section 5.3 of Shrikumar et al., 2019.

If a user does not have access to hypothetical importance scores, it is possible to simply provide the “actual” importance scores in place of the hypothetical importance scores, but the user should anticipate that this can result in motif clusters that are redundant in the sense that they are bound by the same transcription factor but differ at a single base.

Computing the L x 4 hypothetical importance score array when using a non-zero reference

In the more general case where the reference used is not all-zero, computing the hypothetical importance scores is not a simple matter of taking the multiplier term, for reasons analogous to the one discussed above in the section “computing the L x 4 actual importance score array when using a non-zero reference”. This is because the difference-of-input-from-reference term does not automatically reduce to zero at positions that have a value of zero in the input. Given that the purpose of the hypothetical importance scores is to provide additional information on which patterns the model was looking for - akin to an “autocomplete” of the motifs - we can estimate the hypothetical importance scores by (1) keeping the multipliers fixed to those that were computed for the original input, and then (2) calculating what the contribution of a base would have been if the “hypothetical” difference-of-input-from-reference corresponding to the presence of that base were used instead of the actual difference-of-input-from-reference.

We illustrate this procedure by revisiting our example from earlier, where the reference was [0, 1, 0, 0] and the multipliers were [a, b, c, d]. For the actual input of [1, 0, 0, 0], the difference-of-input-from-reference was computed as [1, -1, 0, 0], yielding raw contribution scores of [a, -b, 0, 0] that were ultimately represented as [a - b, 0, 0, 0]. By contrast, if we consider the hypothetical inputs of [0, 1, 0, 0], [0, 0, 1, 0] and [0, 0, 0, 1], the corresponding hypothetical differences-of-input-from-reference are [0, 0, 0, 0], [0, -1, 1, 0] and [0, -1, 0, 1], yield hypothetical raw contribution scores of [0, 0, 0, 0], [0, -b, c, 0] and [0, -b, 0, d], which would ultimately have been represented as [0, 0, 0, 0], [0, 0, c-b, 0] and [0, 0, 0, d-b]. Combining these representations, we find that the hypothetical contribution scores for a position with multipliers [a, b, c, d] and reference [0, 1, 0, 0] are [a-b, 0, c-b, d-b]. If the 0 at the second position appears odd to the reader, note that when non-zero references are used, importance scores are typically computed using many different reference sequences and averaged; thus, the resulting hypothetical importance scores are unlikely to contain zeros at any position unless that position itself has been deemed irrelevant to the output.

Note that the approach described above keeps the multipliers the same as those for the original sequence even when considering hypothetical differences-of-input-from-reference. Given that the goal of the hypothetical importance scores is to reveal which patterns the model is searching for when it sees the original sequence, this is the appropriate approach to take, as it is more informative of the patterns that the model sees in the original sequence (a single-base substitution can radically change the motifs that appear to be present, so if we were to recompute the multipliers for the hypothetical inputs, they may start showing different motifs compared to those present in the original sequence).