Closed vkaul11 closed 4 months ago
We use nltk.sent_tokenize
to tokenize contexts into sentences and insert needle statements between sentences. If we don't tokenize contexts into sentences, the needle statements may be inserted in the middle of a sentence which breaking the structure.
so sent_tokenize identified sentence boundaries and it can work with any hugging face token? Or it uses the internal nltk punkt tokenizer only?
It uses nltk punk tokenizer. If you have some special tokens, then it cannot be separated.
I don't see it being used anywhere else but was curious why nltk.sent_tokenize method has been used here https://github.com/hsiehjackson/RULER/blob/main/scripts/data/synthetic/niah.py#L143 ? How does it help ?