segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
753 stars 44 forks source link

No lookahead models #143

Open arinaruck opened 4 hours ago

arinaruck commented 4 hours ago

Hi! Thank you for the great work! I was wondering if it is possible to make the SaT models trained without lookahead models available (through huggingface). As you point out, SaT models are more versatile that just sentence boundary detection and can be used for paragraph splitting as well (due to the '\n' prediction objective). Based on the Table 9 for Appendix 1 in the SaT paper, it seems to me that limited lookahead nudges the models to perform more split (and more local) than the no-lookahead counterpart, leading to an improved performance on sentence boundary detection. However, this can potentially decrease the quality of paragraph splitting. I would love to test out this hypothesis and share my outcomes if you would be able to release the "no-lookahead" version.

Thank you!