This is motivated by fact that for training stability, you can choose bucketed batching with batch_size specified in the number of tokens per batch.
This can create batches with small number of very long examples which can, however, be later truncated (e.g. in Sequence, Labeler, etc.) by a max_len flag. This will effectively reduce the size of such batch, possibly messing the training process.
This is motivated by fact that for training stability, you can choose bucketed batching with batch_size specified in the number of tokens per batch.
This can create batches with small number of very long examples which can, however, be later truncated (e.g. in Sequence, Labeler, etc.) by a max_len flag. This will effectively reduce the size of such batch, possibly messing the training process.