Closed AlexKnowsIt closed 2 years ago
There was a similar issue #111 so feel free to check it out.
This AttentionCollapse
layer was a result of me mostly just playing around. Therefore there is no reference paper. It is not the notorious full blown self-attention that is used in transformers.
First of all, it is useful to think of the input tensor x
of shape (n_channels, lookback, n_assets)
as a list of length (lookback * n_assets)
where each element is a 1D tensors of shape (n_channels,)
. The elements of this list are so called queries (q1, q2, ...).
We want to learn a key tensor (k) of shape (n_channels,)
. One can think about it as "the most interesting direction". We then take all our queries and check how similar they are to this key. The more similar they are to k the more attention we want to pay to them. Or in other words, how interesting and noteworthy a given lookback, asset
pair is. Finally, we average over the lookback dimension giving higher weights to lookbacks that were interesting.
https://github.com/jankrepl/deepdow/blob/eb6c85845c45f89e0743b8e8c29ddb69cb78da4f/deepdow/layers/collapse.py#L49
Here we take a tensor of shape (n_assets, lookback, n_channels)
and transform it with a linear mapping. The tformed
tensor represents our queries.
Here we effectively compute how similar the queries are to the context_vector.weight
key (forward pass of a of a torch.nn.Linear
without a bias is just a dot product) and we normalize these similarities with a softmax (making sure that if we fix an asset summing up over the lookback axis always gives 1)
https://github.com/jankrepl/deepdow/blob/eb6c85845c45f89e0743b8e8c29ddb69cb78da4f/deepdow/layers/collapse.py#L52
And finally here we just get rid of the lookback
with a weighted average. Using the weights computed above.
Thanks a lot for giving me such a detailed explanation! I understand now way better how you worked with the dimensionality of the tensor. Where I still have problems is Line 50 of this code sample. How is the key tensor (k) derived by putting the tensor of the shape n_assets, lookback, 1
into a fully connected linear layer? There is no gradient or any other optimization technique used to derive how important changes to the value pairs would be in regard to the loss-function. Or generally asked: How is computed what "the most interesting direction" is? Does it just extract the highest value-pairs in the (lookback * n_assets)
tensor? And how can I imagine this interesting direction in general? In case of the BachelierNet
does this show where the neurons of the LSTM-Layer extracted the most salient features?
IMO the code is written in a very unclear way! I was struggling to read it myself:D
Anyway...
https://github.com/jankrepl/deepdow/blob/eb6c85845c45f89e0743b8e8c29ddb69cb78da4f/deepdow/layers/collapse.py#L50 First of all, the shape in the comments is always denoting the shape after running that line of code
The linear layer context_vector
has an attribute weight
(context_vector.weight
) which is the key and it is a learnable parameter. By running the forward pass context_vector(x)
we are essentially computing the dot product betweencontext_vector.weight
and x
. Now the x
can have any shape as long as the last dimension is equal to n_channels
. That is x.shape = (whatever1, whatever2, ..., n_channels)
. When we do context_vector(x)
torch
automatically applies the same linear mapping along the 1st, 2nd,...(n-1)th dimension. Since our linear mapping context_vector
has in_featuers=n_channels
and out_features=1
then we will have
x.shape == (whatever1, whatever2, ...., n_channels)
context_vector(x).shape == (whatever1, whatever2, ..., 1)
Anyway, the context_vector.weight
will be trained based on the data. For example, let's assume that after training it will be weight=(1, -1)
query_1 = (0, 0)
query_2 = (1, 1)
query_3 = (0.9, -1)
query_4 = (-1, 1)
See below the dot products (similarities)
w_1 = weight * query_1 = 0
w_2 = weight * query_2 = 0
w_3 = weight * query_3 = 1.9
w_4 = weight * query_4 = -2
And you could run it through softmax to get
w_1_scaled = 0.1134
w_2_scaled = 0.1134
w_3_scaled = 0.7579
w_4_scaled = 0.0153
So the 3rd query will get 76% of the attention.
Anyway, the principal goal of the AttentionColapse
is to get rid of the lookback dimensions. However, nothing prevents you from using different collapsing strategies. Or even come up with new ones. There are a lot of ways how to do the collapsing:)
The way you explain it makes me feel I understand it, but I think I still don't understand what k is and how it is trained.
I'll try to summarise in my own words what I understood so far:
We start with a tensor of shape (n_samples, n_channels, lookback, n_assets)
lookback, asset
pairs https://github.com/jankrepl/deepdow/blob/eb6c85845c45f89e0743b8e8c29ddb69cb78da4f/deepdow/layers/collapse.py#L49context_vector.weight/k
(?) https://github.com/jankrepl/deepdow/blob/eb6c85845c45f89e0743b8e8c29ddb69cb78da4f/deepdow/layers/collapse.py#L50context_vector.weight/k
https://github.com/jankrepl/deepdow/blob/eb6c85845c45f89e0743b8e8c29ddb69cb78da4f/deepdow/layers/collapse.py#L52If this is correct so far: What does context_vector.weight/k
learn? I understood we calculate the similiarity between it and the queries but I don't understand what context_vector.weight/k
is itself? How does the parameter learn without any optimization technique like gradient descent and without any loss function that would get optimized? How does it calculate this Attention with just the forward pass? And how can I imagine "the most interesting direction". Is it calculating which days in my lookback influence the timestamp of my sample the most?
Now the
x
can have any shape as long as the last dimension is equal ton_channels
. That isx.shape = (whatever1, whatever2, ..., n_channels)
. When we docontext_vector(x)
torch
automatically applies the same linear mapping along the 1st, 2nd,...(n-1)th dimension. Since our linear mappingcontext_vector
hasin_featuers=n_channels
andout_features=1
then we will havex.shape == (whatever1, whatever2, ...., n_channels) context_vector(x).shape == (whatever1, whatever2, ..., 1)
I think this is probably the part I don't fully comprehend. In the example of the BachelierNet
we would have x.shape = (lookback, n_assets, n_channels)
. So do we gain the queries from the context_vector
or from the affine
? https://github.com/jankrepl/deepdow/blob/eb6c85845c45f89e0743b8e8c29ddb69cb78da4f/deepdow/layers/collapse.py#L49-L50 If the comment is describing how the tensor looks after execution of the line I don't understand what the first affine layer
is doing if the context vector is creating the lookback, asset
pairs.
If this is correct so far: What does context_vector.weight/k learn? I understood we calculate the similiarity between it and the queries but I don't understand what context_vector.weight/k is itself?
IMO one does not need to necessarily try to "interpret" it and hope that if we look at it after training it will make some "financial" sense. Especially if the input tensor to this layer is an output of some blackbox pipeline.
And how can I imagine "the most interesting direction". Is it calculating which days in my lookback influence the timestamp of my sample the most?
Yeh, that is literally how I think about it. We want to take a timeseries and then remove the time dimension from it by computing a weighted average. And the only "tricky" task is to use reasonable weights that can change dynamically based on the input tensor.
How does the parameter learn without any optimization technique like gradient descent and without any loss function that would get optimized?
I think this is the part you are getting wrong. Our goal is to learn context_vector.weight
. At the beginning, it will be just randomly initialized. As long as we make the AttentionCollapse
a part of our network and attach a loss to it then it will be updated during training.
I think this is probably the part I don't fully comprehend. In the example of the BachelierNet we would have x.shape = (lookback, n_assets, n_channels). So do we gain the queries from the context_vector or from the affine?
The goal of affine
is to transform the "raw queries" into a new space. Effectively, this makes it more powerful. And it is in this new space that we compare the queries to the context_vector.weight
.
Again, I am by no means claiming that this is the only way to set things up. In general, the attention mechanism can have a lot of different flavors. I don't want you to think that the one implemented in AttentionCollapse
is one of the canonical ones. I would definitely recommend reading Attention is all you need since the attention mechanism introduced in this paper is prevalent in deep learning nowadays. Actually, one could even use it inside of deepdow
quite easily. The paper describes an attention mechanism that takes in a tensor of shape (batch_size, n_tokens, hidden_dim)
and spits out a new tensor of the same shape. In deepdow
it would be possible to flatten the (batch_size, n_channels, lookback, assets)
in some way (see below some ideas)
(batch_size, n_tokens=lookback * n_assets, hidden_dim=n_channels)
(batch_size, n_tokens=n_assets, hidden_dim=n_channels * lookback)
(batch_size, n_tokens=lookback, hidden_dim=n_channels * n_assets)
I think this is the part you are getting wrong. Our goal is to learn context_vector.weight. At the beginning, it will be just randomly initialized. As long as we make the AttentionCollapsea part of our network and attach a loss to it then it will be updated during training.
You are 100% right. I was not aware that there are learnable weights in this network despite the transformation-layer. I also dug deeper inside the Linear-Layer of Pytorch and realised that I was missing some basics there as well. So if somebody comes across this issue and realizes to don't understand this first I can recommend the last part of this article https://deeplizard.com/learn/video/stWU37L91Yc
I will also take a look at your linked paper but it seems that this will take me some time. Thanks for taking the time to try letting me understand this!
I have an understanding question about the collape-layers used in the BachelierNet (but also about the use of collapse layers in general). I can't figure out, what exactly the
AttentionCollapse
is doing. The documentation says that it's alayer that turns (n_samples, hidden_size, lookback, n_assets) into (n_samples, hidden_size, n_assets) by assigning each timestep in the lookback dimension a weight and then performing a weighted average.
How does it generate this weights? It seems to me, that first we change the dimensions of the tensor x and feed it into an own little feed-forward net with two layers and multiply the output of this little net with the tensor x with changed dimensionality. Why is this necessary and why is this called attention? The second collapse in the BachelierNet, theAverageCollapse
, is simply an arithmetic mean over one dimension. Why is it possible to reduce the dimensionality from(n_samples, hidden_size, n_assets)
to(n_samples, n_assets)
with a normal average and why do we need theAttentionCollapse
for the collapse of(n_samples, hidden_size, lookback, n_assets)
into(n_samples, hidden_size, n_assets)
? If there are some explanations (papers, websites) where I can dig deeper in this I would be really interested to learn more.