Open Karoljv opened 3 weeks ago
Wait did you set the pad_token == eos_token during finetuning?
I did since unloth set pad_token to something like this: OPI-PG/Qra-7b does not have a padding token! Will use pad_token "unk". I don't know which pad_token is it refering to but this model has pad_token https://huggingface.co/OPI-PG/Qra-7b/blob/main/tokenizer_config.json
ok I let the finetunning go with this "unk" pad token and I don't have problems with endless generating now. Also I let unsloth fix tokenizer by setting fix_tokenizer = True. I have read one forum when they mentioned that if pad token and eos token is the same. the model tends to not learn eos token properly so it results in endless result
Oh wait I thought we auto set the pad_token :) Did you manually set it?
I have a problem that after finetunning when doing inference. The model does not stop generating another answers even if it already answered the question. The model is based on llama 2. Looks like the model have problems with eos token somehow.
Here is my tokenizer:
LlamaTokenizerFast(name_or_path='OPI-PG/Qra-7b', vocab_size=32000, model_max_length=4096, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '', 'pad_token': ''}, clean_up_tokenization_spaces=False), added_tokens_decoder={
0: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
1: AddedToken("
', 'eos_token': '', 'unk_token': '", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 2: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), }I have set padding to 'right' and set tokenizer.pad_token = tokenizer.eos_token.
My formatting func looks like this:
def create_conversation(sample) -> dict: strip_characters = "\"'" return { "messages": [ {"role": "system", "content": system_message}, {"role": "user", "content": f"{sample['instruction'].strip(strip_characters)} " f"{sample['input'].strip(strip_characters)}"}, {"role": "assistant", "content": f"{sample['output'].strip(strip_characters)}"} ] }
Here is my tokenizer.chat_template(without setting that manually I have got an error)
tokenizer.chat_template = "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif false == true and not '<<SYS>>' in messages[0]['content'] %}{% set loop_messages = messages %}{% set system_message = '' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\n' + system_message + '\n<</SYS>>\n\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'system' %}{{ '<<SYS>>\n' + content.strip() + '\n<</SYS>>\n\n' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}"
The output of generation looks like this: Ceremonia otwarcia Letnich Igrzysk Olimpijskich 2024 w Paryżu była kontrowersyjna ze względu na odtworzenie obrazu Leonarda da Vinci Ostatnia Wieczerza przez drag queens. \n\nCeremonia otwarcia Letnich Igrzysk Olimpijskich 2024 w Paryżu była kontrowersyjna ze względu na odtworzenie obrazu Leonarda da Vinci Ostatnia Wieczerza przez drag queens. \n\nCeremonia otwarcia Letnich Igrzysk Olimpijskich 2024 w Paryżu była kontrowersyjna ze względu na odtworzenie obrazu Leonarda da Vinci Ostatnia Wieczerza przez drag queens. \n\nCeremonia otwarcia Letnich Igrzysk Olimpijskich 2024 w Paryżu była kontrowersyjna ze względu na odtworzenie obrazu Leonarda da Vinci Ostatnia Wieczerza przez drag queens. \n\nCeremonia ot
It keeps repeating the same answer. Why is that?