huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.67k stars 743 forks source link

How to Batch-Encode Paired Input Sentences with Tokenizers: Seeking Clarification #1531

Closed insookim43 closed 1 week ago

insookim43 commented 1 month ago

Hello.

I'm using the tokenizer to encoding pair sentences in TemplateProcessing in batch_encode. There's a confusing part where the method requires two lists for sentence A and sentence B.

According to the guide documentation: "To process a batch of sentences pairs, pass two lists to the Tokenizer.encode_batch method: the list of sentences A and the list of sentences B."

Since it instructs to input two lists, it seems like [[A1, A2], [B1, B2]] --(encode)-> {A1, B1}, {A2, B2}.

However, the actual input expects individual pairs batched, not splitting the sentence pairs into lists for A and B. So, it should be [[A1, B1], [A2, B2]] to encode as {A1, B1}, {A2, B2}.

I've also confirmed that the length of the input list for encode_batch keeps increasing with the number of batches.

Since the guide instructs to input sentence A and sentence B, this is where the confusion arises. If I've misunderstood anything, could you help clarify this point so I can understand it better?

ArthurZucker commented 1 week ago

Hey! That seems intuitively weird to me, could you share a snippet of what you are using? For me you should pass [A1, A2] [B1, B2]. and get [(A1,B1), (A2,B2)]