stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.14k stars 880 forks source link

Make a variant of the dataloader which limits a batch to 5000 words o… #1375

Closed AngledLuffa closed 3 months ago

AngledLuffa commented 3 months ago

Make a variant of the dataloader which limits a batch to 5000 words or less (by default) for the Pipeline. Should help avoid OOM for things such as a few very long sentence soaking up too many resources. https://github.com/stanfordnlp/stanza/issues/1372