mila-iqia / blocks

A Theano framework for building and training neural networks
Other
1.16k stars 351 forks source link

SequenceGenerator and intractable distributions #760

Open rizar opened 9 years ago

rizar commented 9 years ago

In the current version of sequence generator framework it is assumed that it is always possible to emit a next token given the contexts and the previous. The readout.emit method is supposed to return the respective computation graph.

The truth is that it is not always possible. For some sequence generation methods only a cost of generating the whole sequence can be defined. This is what we hit in speech recognition, in which the cost of transcript is defined as log P(W|X) + beta * log Q(W) for whole sequence, making all probabilities intractable.

However, most of sequence_generators.py and search.py could be reused in such cases. This tickets stands for a revision of SequenceGenerator interface that would make generative semantics optional.

@janchorowski, this is a major ticket on our way of using purely Blocks master in fully-neural-lvsr.

janchorowski commented 9 years ago

Unfortunately I don't see you point. Everything is Markovian. So the probability f the next state/emission is always dependent on the current state/input.

The fact that in some models there are states which disallow certain symbols does not violate the Markovian assumption. And techincally, you can always assume that they do allow all outputs, but some have an infinitesimally small probability.

rizar commented 9 years ago

I am not saying that P(y_t|s_1, ...., s_t) is not P(y_t|s_t). What I am saying is, that P(y_t|s_t) is sometimes not available.

An example: suppose we are combining P(W|X) computed by a neural net with Q(W) by a language model in a multiplicative way: COST(W,X) = P(W|X)Q(W). In such cases we typically minimize log(COST) = log P(W_1|X) + log P(W_2| W_1 X) + ... + log Q(W_1) + log Q(W_2|W_1) + .... This way additive scores are defined for each character W_i, but the resulting probability of W_i given W1, ..., W{i-1} under the joint model is in fact intractable! (would require normalization over all W to compute).

On the other hand, the current code and documentation of SequenceGenerator assume that this conditional probability is always available and one can always sample from the distribution defined by the SequenceGenerator. What I propose here is that in its most generic form SequenceGenerator should be just a formula for computing COST(W, X), without assuming that this cost is always a log-likelihood. generate method should become optional, emit method for Emitter interface should become optional.