Monitoring of non-Theano channels.

rizar commented 9 years ago

Good example when it is necessary: BLEU score or WER on the validation set.

The main challenge is that we again need aggregation schemes, and I do not think there is any chance to share code with aggregation schemes for Theano variable without making it very hard to read.

So I think about:

class MonitoredQuantity(object):
    def initialize(self):
         pass
    def accumulate(self, batch):   
         pass
    def readout(self):
         pass

just as we have for Theano variables, except that there is no need any more to have separate Aggregator and AggregationScheme. In a way the class is the scheme, and the object is the aggregator.

In most cases the user will simply pass a callback, and his callback will be wrapped into a default MonitoredQuantity descendant that will do simple averaging. I would put this logic into the DataStreamMonitoring, whereas the DatasetEvaluator would simply call initialize, accumulate and readout methods.

If you guys like this plan, I think I will implement it pretty soon.

bartvm commented 9 years ago

@mohammadpz was asked to implement this for speech recognition, and I already discussed this with him. I proposed a simpler idea though, considering you said you were against implementing any kind of aggregation logic.

My idea then was to simply pass a callback function (basically the accumulate part only) and add a list of values to the log which you would have to post-process/aggregate manually afterwards.

If you want to add aggregation, that's fine by me, but two things about your pseudo code: Your accumulate function needs not only the batch, it might also need the output of some of your Theano functions. You'll need to be able to define the ones it expects. You will also want to be able to define what sources of the batch it should be given, else you will need to re-write your monitor channel each time (e.g. for the supervised and unsupervised case) if your model has different input sources.

So something like this: DatasetEvaluator(callback=(BLUE, ('features',), [beam_search_output])) would make accumulate expect the input sentence and the predicted sentence as an output.

Maybe this is also a good moment to lift the requirement that our data steam provides exactly the same sources as the model inputs. It's feasible that you will want to monitor something here that takes data as an input that doesn't need to go into the model.

Lastly, thinking about it, all of this logic is almost too general just for monitoring, and we might want to call it "callback" instead. For example, this would be a nice place to save the samples of a denoising model to a figure. If you force that into a separate extension you're forced to reimplement the logic for compiling the Theano function, iterating over the validation stream, etc. which would all be handled here.

rizar commented 9 years ago

On 25.02.2015 13:32, Bart van Merriënboer wrote:

@mohammadpz https://github.com/mohammadpz was asked to implement this for speech recognition, and I already discussed this with him. I proposed a simpler idea though, considering you said you were against implementing any kind of aggregation logic.

While it is true that I was against it, now it thinks not so difficult, under assumption that we do not try to integrate with aggregation of Theano variables. That would be hell.

My idea then was to simply pass a callback function (basically the accumulate part only) and add a list of values to the log which you would have to post-process/aggregate manually afterwards.

Hmm, that would make the log look quite ugly. Currently the printing extension prints all channels that do not start with an underscore. Would not like it to print these lists.

If you want to add aggregation, that's fine by me, but two things about your pseudo code: Your accumulate function needs not only the batch, it might also need the output of some of your Theano functions.

The same for the callback you propose, right?

You'll need to be able to define the ones it expects. You will also want to be able to define what sources of the batch it should be given, else you will need to re-write your monitor channel each time (e.g. for the supervised and unsupervised case) if your model has different input sources.

The data format with which DatasetEvaluator operates is a dictionary of data from different sources. Fortunately this PyLearn2-reminiscent problem you expect above seems to be non-relevant.

So something like this: DatasetEvaluator(callback=(BLUE, ('features',), [beam_search_output])) would make accumulate expect the input sentence and the predicted sentence as an output.

I am not sure I understand this code, and also beam search output is not Theano variable. It is written mostly in numpy.

Maybe this is also a good moment to lift the requirement that our data steam provides exactly the same sources as the model inputs. It's feasible that you will want to monitor something here that takes data as an input that doesn't need to go into the model.

That's quite a good point. But this requirement is a very good protection from non-trivial bugs... I would make it switchable and on by default.

Lastly, thinking about it, all of this logic is almost too general just for monitoring, and we might want to call it "callback" instead. For example, this would be a nice place to save the samples of a denoising model to a figure. If you force that into a separate extension you're forced to reimplement the logic for compiling the Theano function, iterating over the validation stream, etc. which would all be handled here.

No problem, it can be used for other stuff, but I think that it is better that terminology-wise we stick to the very clear use-case of BLUE and WER. Thus we will have readable code with no extra abstraction layer.

— Reply to this email directly or view it on GitHub https://github.com/bartvm/blocks/issues/349#issuecomment-75952304.

bartvm commented 9 years ago

For printing I figured we could do something like print value[:25] + '...' if len(str(value)) > 25 else value. But let's go with aggregation, I like that much better, that way you can at least see the actual BLEU score instead of only calculating it afterwards (allowing you to e.g. do early stopping on it).

The same for the callback you propose, right?

My point is just the following: Any monitoring quantity should be able to specify a list of data stream sources it needs, as well as a list of Theano values from which it wants the value. These Theano variables can then be compiled together with the rest of the DatasetEvaluator. Let's say I want to calculate the WER, I will need the target sentence (from the data stream) and the output sentence (let's see without beam search, so a Theano variable). I don't see the point of monitoring on the batches of the data stream alone. Consider this example:

class LevenshteinDistance(MonitoredQuantity ):
    """Edit distance between language model predictions and targets.

    Assumes on-line learning (i.e. only one example per batch).

    """
    def __init__(self, vocab):
        self.vocab = vocab
        self.total_distance, self.examples_seen = 0, 0

    def accumulate(self, batch, outputs):
        target_words, = batch
        predicted_words, = outputs
        self.total_distance += levenshtein_distance(
            ' '.join(vocab[target_word] for target_word in targets_words),
            ' '.join(vocab[output_word] for output_word in output_word)
        )
        self.examples_seen += 1

    def readout(self):
        return self.total_distance / float(self.examples_seen)

This would then be invoked by something like:

DatasetEvaluator(callbacks=[(Levenshtein, ('targets',), [tensor.argmax(probs, axis=1)], vocab)])

Telling the DatasetEvaluator to compile the expression tensor.argmax(probs, axis=1) with the rest of the Theano variables. It then calls Levenshtein.accumulate each iteration with the requested data from the data stream ('targets', e.g. the next 3 words which the language model had to predict) and the output of the requested Theano variables (the predictions in this case).

mpezeshki commented 9 years ago

@bartvm @rizar , Thank you guys. I try to understand your ideas and I'll get back to you asap.

rizar commented 9 years ago

Okay, I like your idea with outputs, it is crystal clear. I still think that requesting sources is redundant and complicates the picture. Why not give accumulate the dictionary for the whole batch?

I would make the requested Theano variables a part of the MonitoredQuantity interface. That is

DatasetEvaluator(quantities=[
    # supported now
    a_variable, 
    # the innovation
    LevenshteinDistance(requires=[tensor.argmax(probs, axis=1)]),
    # a shortcut for creating simple monitored quantities
    a_callback,  
])

bartvm commented 9 years ago

Because you don't know what the names of the sources are. What might be called features in one dataset could be called input_words or features0 in another. You can't rely on a particular naming scheme for sources inside of the accumulate function, so you need to tell the evaluator somehow which ones to pass and in what order, so that the accumulate function knows how to consume them.

The alternative is to use a kind of RenameStream, but that gets really messy, because then the user also needs to change the names of his variables, just to please a monitoring channel.

rizar commented 9 years ago

If you do not want to rely on source names, you can add the input variables of your computation graph as requests. I just do not want two different ways of requesting data for one poor little class.

bartvm commented 9 years ago

Two things:

That removes the option of requesting sources that the model doesn't use, which I think is important e.g. there might be some sort of metadata that my model is not aware of but that I want to use for monitoring. As a random example, let's say my data is from different domains, and I want to check the BLEU score difference between these domains. I could easily add a data source with this information and feed that to the extension, but there's no Theano variable for this.

Secondly, how intelligent is Theano if it comes to outputting input variables? I just want to make sure we're not doubling the memory used, or worse, transferring data to GPU and immediately transfering it back.

I understand that it seems overly complicated, but I gave it a bit of thought yesterday, and so far I don't see a simpler solution that is flexible enough. This is a pretty important callback, which should be able to do all sorts of crazy monitoring things, so I don't like the idea of hardcoding source names, or limiting it to the same data the model consumes.

rizar commented 9 years ago

First: well, you can create a stand-alone variable with the same name as the source you need and it will be matched with the corresponding data somewhere in DatasetEvaluator. This case is not an exception of the general scheme.

Second: if Theano is inefficient, we can handle input variables differently under the hood. The user would keep using variables only, which is very Block-ish. The abstraction of data source be prohibited from spreading all over Blocks code.

bartvm commented 9 years ago

Not sure I find it Block-ish to create fake Theano variables that aren't part of any computation graph, but I guess it could keep the interface a bit cleaner.

I think that if we do that, we should still ease the restriction on the sources and input variables matching exactly. I would find it annoying if I remove a MonitorQuantity and suddenly get an error because now my data stream provides a source that isn't used anymore. I think we should just raise a warning when there are unused sources, something along the lines of "WARNING: Your data stream is providing sources that are not being used by the model, it might be more efficient to not load these sources at all."

bartvm commented 9 years ago

@mohammadpz I think we've converged on a first concept :) Also, @rizar's interface as proposed in https://github.com/bartvm/blocks/issues/349#issuecomment-75962309 is nicer than my idea of passing a tuple to callbacks=, so let's stick with that. Let me know if you have any questions, this has been a lengthy discussion, like usual :p

rizar commented 9 years ago

In DatasetEvaluator we do not have such restriction at all. So removing your quantity will not change anything. But I do not mind switching to warning there anyway.

Regarding implementation of this: it will be a way easier if in the first draft we do not try to integrate with the AggregationBuffer, but compile a separate function to provide inputs for these monitored quantities. Even though it can be two times slower in the cases when everything will have to be recomputed. After that we can try to have all computations done in one Theano function.

mpezeshki commented 9 years ago

@bartvm , sure :)

bartvm commented 9 years ago

Just had a look, and it does seem we need to handle input variables for sources that are not part of the graph separately. If you do theano.function([x], [x]) it results in a DeepCopyOp, which is clearly overkill. However, maybe we can leave this optimization for a later ticket (let's just not forget to make the new issue after completing this one).

dwf commented 9 years ago

Addressed in #524.

mila-iqia / blocks

Monitoring of non-Theano channels. #349