Closed 5vision closed 7 years ago
Have you had a chance to checkout this how-do-I section: https://github.com/Microsoft/CNTK/wiki/Implement-an-attention-mechanism
Yes, but this example is simply, because variables question
and answer
could use the same dynamic axis. In the case of seq2seq model the attention mechanism depends on a decoder sequence, not only encoder sequence. Back to example, zq
will depend on the other sequence, for e.g. the previous hidden state of the decoder.
@sayanpa do you know how to deal with that in CNTK from python?
Hi, the current approach to get this done in CNTK is to create a "past value window" where you map the elements along a dynamic axis to a fixed-length static axis with padding where necessary (this is how it's done in all toolkits but generally you already have things along a static axis and you've done the padding as part of your pre-processing). One way to do this is the following:
# Create a function which returns a static, maskable view for N past steps over a sequence along the given 'axis'.
# It returns two records: a value matrix, shape=(N,dim), and a valid window, shape=(1,dim)
def past_value_window(N, x, axis=0):
# this is to create 1's along the same dynamic axis as `x`
ones_like_input = times(x, constant(0, shape=(x.shape[0],1))) + 1
last_value = []
last_valid = []
value = None
valid = None
for t in range(N):
if t == 0:
value = x
valid = ones_like_input
else:
value = past_value(x, time_step=t)
valid = past_value(ones_like_input, time_step=t)
last_value.append(last(value))
last_valid.append(last(valid))
# stack rows 'beside' each other, so axis=axis-2 (create a new static axis that doesn't exist)
value = splice(*last_value, axis=axis-2, name='value')
valid = splice(*last_valid, axis=axis-2, name='valid')
# value[t] = value of t steps in the past; valid[t] = true if there was a value t steps in the past
return (value, valid)
In the example above we would pass in, say, the hidden states of an LSTM encoder as x
and the N
would be the attention span of our attention model.
Further, the above functionality will soon be available as a Layer from the layers library called PastValueWindow() and that will land in master in the next week or so. In the meantime, you can take a look at an example that uses the above approach to implement a seq2seq model with attention in a private branch: wdarling/lstmaux and in the location
@wdarling, thank you! I see, you just take the last
vector of each past_value
sequence. It looks like BrainScript version. Could you answer a couple of questions more:
last(value, -10)
.ReconcileDynamicAxisNode
?
This error is happens if implement the attention mechanism with different dynamic axis of encoder and decoder. Because
nestedNodes
contains, for some reasons,FutureValue
andPastValue
node simultaneously.@frankseide could you please give some hints how to implement the attention mechanism with python interface, if it's possible.