Reader for t2t-datasets

ufal / neuralmonkey

An open-source tool for sequence learning in NLP built on TensorFlow.

BSD 3-Clause "New" or "Revised" License

410 stars 102 forks source link

Reader for t2t-datasets #681

Open varisd opened 6 years ago

varisd commented 6 years ago

Add a reader for the dataset files prepared by t2t-dataset.

Motivation: t2t-dataset preprocesses files in such a way that allows more efficient batching which should lead to more efficient training. The implementation will also require modification of the current batching scheme ("dynamic" batch sizes with regard to the number of sentences based on the length of sentences in the batch).

jlibovicky commented 6 years ago

I guess a good solution would be implementing a new implementation of Dataset which would have the same interface, but do the batching differently.

varisd commented 6 years ago

We briefly discussed the issue with @tomkocmi and the support for t2t-datasets might not be necessary. However, the "dynamic" batching (and possibly bucketing) still should be a nice extension for the Dataset.