CLIP Tokenizer unable to take text of unequal length (or token length)

ml-explore / mlx-examples

Examples in the MLX framework

MIT License

5.82k stars 827 forks source link

CLIP Tokenizer unable to take text of unequal length (or token length) #787

Open kechan opened 3 months ago

kechan commented 3 months ago

this is about clip/tokenizer.py

The example:

tokenizer(["a photo of a cat", "a photo of a dog"])

only works 'cos they have exact same # of tokens. this one will fail:

tokenizer(["hello world", "how you doing?"])

ValueError: Initialization encountered non-uniform length.

kechan commented 3 months ago

the tokenizer is doing this:

return mx.array([self.tokenize(t, prepend_bos, append_eos) for t in text])

if text is a List[str], and mx.array doesn't like non uniform list of numbers.

I can "hack" this to work by padding this here with EOT. I went over to HF and they seemed to also "hack" this??

pad_token="<|endoftext|>", # hack to enable padding

so just to be consistent with it.

I have this local changes done. I am not sure if this repos is for edu purpose on how to use MLX and not being completely robust and faithful port of huggingface, otherwise, I can do a PR if this fix helps.

awni commented 3 months ago

I think it would be fine to pad in the tokenizer. Usually we try to keep the examples simple so they are useful for educational purposes / starting points for building from. In this case I don't think it will add much complexity to pad in the tokenizer.

But I guess the clip_loss would be a bit off if you don't consider the padding?

kechan commented 3 months ago

Understood. On the way, i took a quick look at the tokenizer and found probably wasn’t the goal to replicate the HF completely. I followed closely what HF and fixed it locally, and it worked fine. HF tokenizer use end-of-Text to pad, so it is at least as correct as HF seemed to have it.

I wasn’t into training yet so not sure if the clip loss wants to be taken into account the padding method.

awni commented 3 months ago

I wasn’t into training yet so not sure if the clip loss wants to be taken into account the padding method.

Right but we do evaluate the loss in the current example even when not training. I'm not entirely sure why. Maybe we should remove it o/w it would be incorrect in the padded case.

x4080 commented 3 months ago

@kechan where do you add the pad_token="<|endoftext|>", ? Thanks

kechan commented 3 months ago

@x4080

Find "class CLIPTokenizer", and then

add

@property
def pad_token(self):
    return self.vocab[self.eos]

chg def tokenize:

    if isinstance(text, list):
        # return mx.array([self.tokenize(t, prepend_bos, append_eos) for t in text])
        # Tokenize each text
        tokenized_texts = [self.tokenize(t, prepend_bos, append_eos) for t in text]

        # find the maximum length
        max_len = max([len(t) for t in tokenized_texts])        

        # pad all the tokenized texts to the maximum length
        padded_tokenized_texts = []
        for t in tokenized_texts:
            padded_text = mx.concatenate([t, mx.array([self.pad_token] * (max_len - len(t)))], axis=0)
            padded_tokenized_texts.append(padded_text)

        return mx.array(padded_tokenized_texts, dtype=mx.int32)

I haven't robustly test this padding. I think this impl. is closely mirroring what huggingface does, and I think except some syntax error, it came out almost entirely out of copilot, ;-).

x4080 commented 3 months ago

@kechan Thanks a lot