microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.53k stars 4.28k forks source link

seq2seq + attention in python #1436

Closed 5vision closed 7 years ago

5vision commented 7 years ago

This error is happens if implement the attention mechanism with different dynamic axis of encoder and decoder. Because nestedNodes contains, for some reasons, FutureValue and PastValue node simultaneously.

@frankseide could you please give some hints how to implement the attention mechanism with python interface, if it's possible.

sayanpa commented 7 years ago

Have you had a chance to checkout this how-do-I section: https://github.com/Microsoft/CNTK/wiki/Implement-an-attention-mechanism

5vision commented 7 years ago

Yes, but this example is simply, because variables question and answer could use the same dynamic axis. In the case of seq2seq model the attention mechanism depends on a decoder sequence, not only encoder sequence. Back to example, zq will depend on the other sequence, for e.g. the previous hidden state of the decoder.

@sayanpa do you know how to deal with that in CNTK from python?

wdarling commented 7 years ago

Hi, the current approach to get this done in CNTK is to create a "past value window" where you map the elements along a dynamic axis to a fixed-length static axis with padding where necessary (this is how it's done in all toolkits but generally you already have things along a static axis and you've done the padding as part of your pre-processing). One way to do this is the following:

# Create a function which returns a static, maskable view for N past steps over a sequence along the given 'axis'.
# It returns two records: a value matrix, shape=(N,dim), and a valid window, shape=(1,dim)
def past_value_window(N, x, axis=0):

    # this is to create 1's along the same dynamic axis as `x`
    ones_like_input = times(x, constant(0, shape=(x.shape[0],1))) + 1

    last_value = []
    last_valid = []
    value = None
    valid = None

    for t in range(N):
        if t == 0:
            value = x
            valid = ones_like_input
        else:
            value = past_value(x, time_step=t)
            valid = past_value(ones_like_input, time_step=t)            

        last_value.append(last(value))
        last_valid.append(last(valid))

    # stack rows 'beside' each other, so axis=axis-2 (create a new static axis that doesn't exist)
    value = splice(*last_value, axis=axis-2, name='value')
    valid = splice(*last_valid, axis=axis-2, name='valid')

    # value[t] = value of t steps in the past; valid[t] = true if there was a value t steps in the past
    return (value, valid)

In the example above we would pass in, say, the hidden states of an LSTM encoder as x and the N would be the attention span of our attention model.

Further, the above functionality will soon be available as a Layer from the layers library called PastValueWindow() and that will land in master in the next week or so. In the meantime, you can take a look at an example that uses the above approach to implement a seq2seq model with attention in a private branch: wdarling/lstmaux and in the location /Examples/SequenceToSequence/CMUDict/Python.

5vision commented 7 years ago

@wdarling, thank you! I see, you just take the last vector of each past_value sequence. It looks like BrainScript version. Could you answer a couple of questions more:

  1. Is it possible or will be ease to extend this version to the sliding window attention? For example, to take the shifted vector last(value, -10).
  2. From my point of view, it is more efficient to broadcast a previous hidden state of decoder rather than the last values of encoder. Why this is not done in CNTK core? For example, somewhere inside the ReconcileDynamicAxisNode?
GuntaButya commented 5 years ago

Try: https://docs.microsoft.com/en-us/cognitive-toolkit/How-do-I-Express-Things-In-Python#implement-an-attention-mechanism

New site now.