Open varisd opened 6 years ago
I guess a good solution would be implementing a new implementation of Dataset
which would have the same interface, but do the batching differently.
We briefly discussed the issue with @tomkocmi and the support for t2t-datasets might not be necessary. However, the "dynamic" batching (and possibly bucketing) still should be a nice extension for the Dataset.
Add a reader for the dataset files prepared by t2t-dataset.
Motivation: t2t-dataset preprocesses files in such a way that allows more efficient batching which should lead to more efficient training. The implementation will also require modification of the current batching scheme ("dynamic" batch sizes with regard to the number of sentences based on the length of sentences in the batch).