hsiehjackson / RULER

This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?
Apache License 2.0
321 stars 17 forks source link

what was the reason to use nltk in NIAK task here #19

Closed vkaul11 closed 1 month ago

vkaul11 commented 1 month ago

I don't see it being used anywhere else but was curious why nltk.sent_tokenize method has been used here https://github.com/hsiehjackson/RULER/blob/main/scripts/data/synthetic/niah.py#L143 ? How does it help ?

hsiehjackson commented 1 month ago

We use nltk.sent_tokenize to tokenize contexts into sentences and insert needle statements between sentences. If we don't tokenize contexts into sentences, the needle statements may be inserted in the middle of a sentence which breaking the structure.

vkaul11 commented 1 month ago

so sent_tokenize identified sentence boundaries and it can work with any hugging face token? Or it uses the internal nltk punkt tokenizer only?

hsiehjackson commented 1 month ago

It uses nltk punk tokenizer. If you have some special tokens, then it cannot be separated.