mila-iqia / fuel

A data pipeline framework for machine learning
MIT License
867 stars 268 forks source link

Bug in Transformer.get_epoch_iterator #106

Open vdumoulin opened 9 years ago

vdumoulin commented 9 years ago

Here's a minimal example reproducing the bug:

import numpy

from fuel.datasets import IndexableDataset
from fuel.streams import DataStream
from fuel.schemes import SequentialScheme
from fuel.transformers import Mapping

def run(bug=True):
    constructor = DataStream.default_stream if bug else DataStream
    dataset = IndexableDataset(numpy.arange(100))
    dataset.default_transformers = ((Mapping, [lambda d: (2 * d[0],)], dict()),)
    data_stream = constructor(dataset, iteration_scheme=SequentialScheme(100, 10))
    iterator_1 = data_stream.get_epoch_iterator()
    iterator_2 = data_stream.get_epoch_iterator()
    for batch in iterator_2:
        pass
    next(iterator_1)

The problem lies in the fact that Transformer sets its child_epoch_iterator attribute when get_epoch_iterator is called. Even though the two iterators are different, the data stream they point to has the same child epoch iterator.

This is an issue that can arise e.g. in Blocks when someone re-uses the same data stream for monitoring and for training. The obvious solution is "don't ever use a data stream for more than one purpose", but it took me quite some time to figure out what the problem was when I was made aware of the crash. It would be good if we could find a better long-term solution to that.

rizar commented 9 years ago

The datastreams in Fuel are indeed stateful, but this is so by design. Perhaps we should raise a warning when get_epoch_iterator overrides a non-empty child_epoch_iterator?

vdumoulin commented 9 years ago

How would you define non-empty? If next(child_epoch_iterator) does not return StopIteration?

rizar commented 9 years ago

This seems to be the only possible solution, although it looks sort of ugly.

In fact a data stream is something like a file, so maybe we should emulate the file behaviour at this point:

In [3]: s = StringIO('a\nb\nc\n\d')

In [4]: s
Out[4]: <StringIO.StringIO instance at 0x7efdeab5ac68>

In [5]: it = iter(s)

In [6]: next(it)
Out[6]: 'a\n'

In [7]: it2 = iter(s)

In [8]: next(it2)
Out[8]: 'b\n'

In [9]: next(it)
Out[9]: 'c\n'

that is not to reset the child_epoch_iterator. What do you think? This still makes the user take care of not reusing a data stream, though.

vdumoulin commented 9 years ago

Sounds good. I'd document that thoroughly, though. How would you go about implementing that in practice?

rizar commented 9 years ago

After sleeping over it, it sounds not so good anymore. The main loop relies on the fact the get_epoch_iterator does reset the data stream. More specifically, that it switches the data stream to the next epoch, which so far for all available datasets is equivalent to calling reset (there is some capacity for different epochs in the core).

That being said, I think that this is a documentation issue, and no changes in the code should be done.