Update data iterator - Githubissues

kyunghyuncho / NMT

1 stars 1 forks source link

Update data iterator #6

Closed sebastien-j closed 9 years ago

sebastien-j commented 9 years ago

As was done previously in Groundhog, it would be beneficial to merge a few batches together, order the sentences by length and create new homogeneous batches.

Related to issue #4 (large vocabulary)

rizar commented 9 years ago

Oh, that one we already have. It needs three Fuel transformers. Batch creates big batches of let's say 1000 elements, Mapping with applies SortMapping to sort examples within the batch and Unpack transforms a stream of batches into a stream of examples. Snippet from my phoneme recognition scripts:

if sort_k_batches:
    assert batch_size
    stream = Batch(stream,
                          iteration_scheme=ConstantScheme(
                              batch_size * sort_k_batches))
    stream = Mapping(stream, SortMapping(_length))
    stream = Unpack(stream)

orhanf commented 9 years ago

I've adapted the above snippet for mt experiments as follows:

stream = Filter(stream, predicate=too_long)
stream = Mapping(stream, oov_to_unk)
stream = Batch(stream, iteration_scheme=ConstantScheme(80*12))
stream = Mapping(stream, SortMapping(_length))
stream = Unpack(stream)
stream = Batch(stream, iteration_scheme=ConstantScheme(80))
masked_stream = Padding(stream)

But this raises a pickling error because of the last Unpack and Batch pair, which implicitly generates a <type 'iterator'> somehow. Any clues or workaround suggestions?

rizar commented 9 years ago

There was an issue with picklable_itertools, see https://github.com/bartvm/fuel/issues/44. Try to update them and it should help.

orhanf commented 9 years ago

Done here with the latest commit