Open kechan opened 3 months ago
the tokenizer is doing this:
return mx.array([self.tokenize(t, prepend_bos, append_eos) for t in text])
if text is a List[str], and mx.array doesn't like non uniform list of numbers.
I can "hack" this to work by padding this here with EOT. I went over to HF and they seemed to also "hack" this??
pad_token="<|endoftext|>", # hack to enable padding
so just to be consistent with it.
I have this local changes done. I am not sure if this repos is for edu purpose on how to use MLX and not being completely robust and faithful port of huggingface, otherwise, I can do a PR if this fix helps.
I think it would be fine to pad in the tokenizer. Usually we try to keep the examples simple so they are useful for educational purposes / starting points for building from. In this case I don't think it will add much complexity to pad in the tokenizer.
But I guess the clip_loss would be a bit off if you don't consider the padding?
Understood. On the way, i took a quick look at the tokenizer and found probably wasn’t the goal to replicate the HF completely. I followed closely what HF and fixed it locally, and it worked fine. HF tokenizer use end-of-Text to pad, so it is at least as correct as HF seemed to have it.
I wasn’t into training yet so not sure if the clip loss wants to be taken into account the padding method.
I wasn’t into training yet so not sure if the clip loss wants to be taken into account the padding method.
Right but we do evaluate the loss in the current example even when not training. I'm not entirely sure why. Maybe we should remove it o/w it would be incorrect in the padded case.
@kechan where do you add the pad_token="<|endoftext|>",
? Thanks
@x4080
Find "class CLIPTokenizer", and then
@property
def pad_token(self):
return self.vocab[self.eos]
chg def tokenize:
if isinstance(text, list):
# return mx.array([self.tokenize(t, prepend_bos, append_eos) for t in text])
# Tokenize each text
tokenized_texts = [self.tokenize(t, prepend_bos, append_eos) for t in text]
# find the maximum length
max_len = max([len(t) for t in tokenized_texts])
# pad all the tokenized texts to the maximum length
padded_tokenized_texts = []
for t in tokenized_texts:
padded_text = mx.concatenate([t, mx.array([self.pad_token] * (max_len - len(t)))], axis=0)
padded_tokenized_texts.append(padded_text)
return mx.array(padded_tokenized_texts, dtype=mx.int32)
I haven't robustly test this padding. I think this impl. is closely mirroring what huggingface does, and I think except some syntax error, it came out almost entirely out of copilot, ;-).
@kechan Thanks a lot
this is about clip/tokenizer.py
The example:
tokenizer(["a photo of a cat", "a photo of a dog"])
only works 'cos they have exact same # of tokens. this one will fail:
tokenizer(["hello world", "how you doing?"])
ValueError: Initialization encountered non-uniform length.