Dataset series should support max_len (max_size) flag.

ufal / neuralmonkey

An open-source tool for sequence learning in NLP built on TensorFlow.

BSD 3-Clause "New" or "Revised" License

410 stars 106 forks source link

Dataset series should support max_len (max_size) flag. #817

Open varisd opened 5 years ago

varisd commented 5 years ago

This is motivated by fact that for training stability, you can choose bucketed batching with batch_size specified in the number of tokens per batch.

This can create batches with small number of very long examples which can, however, be later truncated (e.g. in Sequence, Labeler, etc.) by a max_len flag. This will effectively reduce the size of such batch, possibly messing the training process.