multilayer encoding (decoding)

Here's an idea that has been on my mind for some time:

Currently NMonkey supports encoding-decoding of sequences of tokens [batch_size, max_seq_len], which is then represented in embedding space as [batch_size, max_seq_len, emb_size]. Let's call these 1D sequences.

However it would be nice, to be able to represent "multidimensional" sequences.

Example: I want to encode sentence where each word embedding is created from its embedded characters using separate encoder (the embedding of the word is a final state of the encoder output).

The input in this case is a matrix of size [batch_size, max_seq_len, max_word_len] (max_seq_len is the length of sentence measured in number of words). We than create a representation in two steps:

(using char emb. matrix) [batch_size, max max_seq_len, max_word_len, char_emb_size]
(using encoder) [batch_size, max_seq_len, word_emb_size] The input sentences are represented as strings, we define the word/character-level boundaries by specifying separators (in this case " " for words and empty string for character-level

Let's call this a 2D sequence. This can be of course expanded to nD sequences.

Technically, this can be expanded also to decoding (first we decode word representation and then decode sequences of characters based on these representations). However, I think this should be left as a separate issue.

What I would like to discuss here is a reasonable approach towards implementing this. My ideas so far:

A class derived from Vocabulary (e.g. MultiDimVocabulary) -- creates the multidimensional representations of the sentences (based on the input separators). -- Handles padding in each dimension.
A Sequence class derivation (e.g. RecursiveSequence) -- has an encoder as an argument (probably different decoder/learnable_decoder_params for each layer) -- In each "representation layer" transforms [batch_size, dim1, dim2, ... dimN, emb_size] to [batch_size x dim1 x dim2 x ... dimN-1, dimN, emb_size1], computes embeddings based in the N-th dimensions and returns reshaped [batch_size, dim1, dim2, ..., dimN-1, emb_size2]. Emb_size1 and emb_size2 may vary. The recursion stops when we get to the level of [batch_size, dim1, emb_size], then we can use an encoder in the classic way.

I am currently sure whether to perform the "recursive" encoding in the sequence or rather define a decoder class for this. I am also aware that some changes to the sentence reader will be necessary.

Before I will start working on this I would like to have a discussion on the topic to get some insights. Feel free to contribute with any ideas.

ufal / neuralmonkey

multilayer encoding (decoding) #605