Closed sebastien-j closed 9 years ago
Oh, that one we already have. It needs three Fuel transformers. Batch
creates big batches of let's say 1000 elements, Mapping
with applies SortMapping
to sort examples within the batch and Unpack
transforms a stream of batches into a stream of examples. Snippet from my phoneme recognition scripts:
if sort_k_batches:
assert batch_size
stream = Batch(stream,
iteration_scheme=ConstantScheme(
batch_size * sort_k_batches))
stream = Mapping(stream, SortMapping(_length))
stream = Unpack(stream)
I've adapted the above snippet for mt experiments as follows:
stream = Filter(stream, predicate=too_long)
stream = Mapping(stream, oov_to_unk)
stream = Batch(stream, iteration_scheme=ConstantScheme(80*12))
stream = Mapping(stream, SortMapping(_length))
stream = Unpack(stream)
stream = Batch(stream, iteration_scheme=ConstantScheme(80))
masked_stream = Padding(stream)
But this raises a pickling error because of the last Unpack
and Batch
pair, which implicitly generates a <type 'iterator'>
somehow. Any clues or workaround suggestions?
There was an issue with picklable_itertools
, see https://github.com/bartvm/fuel/issues/44. Try to update them and it should help.
As was done previously in Groundhog, it would be beneficial to merge a few batches together, order the sentences by length and create new homogeneous batches.
Related to issue #4 (large vocabulary)