ml-explore / mlx-examples

Examples in the MLX framework
MIT License
5.5k stars 791 forks source link

A simple enhancement, in dataset creation time #811

Closed mustangs0786 closed 2 weeks ago

mustangs0786 commented 1 month ago

Hi Team, recently, i was using phi3 model to train model on binary classification datatset. Despite training it for good amount of time, when i try to use it, it give me output but fails to stop after giving class name. Problem : in current code there is no concept of adding eos_token . Hence after giving otuput it start giving prompt back.

so i made slight change in dataset creation function and after that, i trained the model again and it worked like charm. i was not getting one word output as i want. there is no more un neccessary extra answer. `class ChatDataset(Dataset):

def __init__(self, path: Path, tokenizer: PreTrainedTokenizer):
    super().__init__(path)
    self._tokenizer = tokenizer

def __getitem__(self, idx: int):
    messages = self._data[idx]["messages"]
    text = self._tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    text = text+self._tokenizer.eos_token   ## this is new addition of code.
    # print(text)
    return text`

requesting github repo owner, to kindly look into this and update the code..thanks

awni commented 3 weeks ago

Thanks. It looks like most tokenizers do not default append the eos token id. So we can append it manually when fine-tuning. I'm just wondering if we should make it an option or not.. most likely the typical case will want the eos token appended to each example.