what was the reason to use nltk in NIAK task here

nvtransfer / RULER

This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?

Apache License 2.0

646 stars 43 forks source link

what was the reason to use nltk in NIAK task here #19

Closed vkaul11 closed 4 months ago

vkaul11 commented 4 months ago

I don't see it being used anywhere else but was curious why nltk.sent_tokenize method has been used here https://github.com/hsiehjackson/RULER/blob/main/scripts/data/synthetic/niah.py#L143 ? How does it help ?

hsiehjackson commented 4 months ago

We use nltk.sent_tokenize to tokenize contexts into sentences and insert needle statements between sentences. If we don't tokenize contexts into sentences, the needle statements may be inserted in the middle of a sentence which breaking the structure.

vkaul11 commented 4 months ago

so sent_tokenize identified sentence boundaries and it can work with any hugging face token? Or it uses the internal nltk punkt tokenizer only?

hsiehjackson commented 4 months ago

It uses nltk punk tokenizer. If you have some special tokens, then it cannot be separated.