This means that if we have a file that contains multiple paragraphs like:
This is a paragraph. It contains multiple sentences. We want to split these.
The second paragraph. This should also be in the same batch.
It will have 2 forward passes regardless of the batch_size, leading to heavily under-utilized GPU resources. Instead, if the batch_size is set large enough, we should be able to process both of them in a single forward pass.
The current implementation does not properly use
batch_size
across lines in a file since every line is processed individually:https://github.com/rewicks/ersatz/blob/e5ed3ebbc64ac5993093ee42bca3a282d45e556e/ersatz/split.py#L169
This means that if we have a file that contains multiple paragraphs like:
It will have 2 forward passes regardless of the
batch_size
, leading to heavily under-utilized GPU resources. Instead, if thebatch_size
is set large enough, we should be able to process both of them in a single forward pass.