neulab / knn-transformers

PyTorch + HuggingFace code for RetoMaton: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022), including an implementation of kNN-LM and kNN-MT
MIT License
271 stars 22 forks source link

group_texts function: Why? #15

Open HossamAmer12 opened 1 month ago

HossamAmer12 commented 1 month ago

There is a data function called group_texts. I understand that this function concatenates the texts and creates blocks of text with specific block size. I wish to understand why you do so? Why not padding to a specific tokenizer max length for a rect tensor? Could you please explain why you opted for this way?

urialon commented 1 month ago

Hi Hossam,

I don't think I implemented this function myself, I think I copied it from Huggingface's language modeling example.

If I remember correctly, it's just more efficient than padding, since you can pack more documents in the same batch, and padding is basically a waste of compute.

Another thing I think this function does, if I remember correctly, is building the sliding window evaluation. This means that there are overlaps between document, but every token is predicted only once, and serves as context in the next chunk.

Best, Uri

On Sun, Oct 20, 2024 at 08:01 Hossam Amer @.***> wrote:

There is a data function called group_texts. I understand that this function concatenates the texts and creates blocks of text with specific block size. I wish to understand why you do so? Why not padding to a specific tokenizer max length for a rect tensor? Could you please explain why you opted for this way?

— Reply to this email directly, view it on GitHub https://github.com/neulab/knn-transformers/issues/15, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSOXMHZ2TRAGUAMWDSBKDDZ4OLSHAVCNFSM6AAAAABQIOKWCSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGYYDAMZZGM3TQNY . You are receiving this because you are subscribed to this thread.Message ID: @.***>