rewicks / ersatz

Apache License 2.0
39 stars 5 forks source link

`batch_size` does not work across lines #12

Open SirRob1997 opened 1 year ago

SirRob1997 commented 1 year ago

The current implementation does not properly use batch_size across lines in a file since every line is processed individually:

https://github.com/rewicks/ersatz/blob/e5ed3ebbc64ac5993093ee42bca3a282d45e556e/ersatz/split.py#L169

This means that if we have a file that contains multiple paragraphs like:

This is a paragraph. It contains multiple sentences. We want to split these.
The second paragraph. This should also be in the same batch.

It will have 2 forward passes regardless of the batch_size, leading to heavily under-utilized GPU resources. Instead, if the batch_size is set large enough, we should be able to process both of them in a single forward pass.