Open LennartBuerger opened 3 days ago
Hi,
Thanks for catching this!
This definitely wasn't the case for the runs in the paper - the generation stopped at the stop_token_id. Probably something went wrong during my code refactoring before I released it, and I missed it in my tests. I'll take a look and update soon.
Hi! I noticed that for instruct models the stop_token_id in generate_model_answers.py is set to None. This causes instruct models (e.g. Mistral-7B-Instruct-v.0.2) to continue generating text after the end of sequence token (e.g. \) has been generated. Most of the time the instruct model starts a conversation with itself that diverges more and more from the original question. It only stops generating after the maximum number of tokens is reached. This is pretty inefficient but does not break anything yet. However, my worry is when probing for the token ids -8, -7, -6, -5, -4, -3, -2 and -1 in probe_all_layers_and_tokens.py, the last tokens of the very long response are probed and not the tokens before the \ token. Is this correct? If yes, I would suggest adding these lines into generate_model_answers.py:
if args.model == "mistralai/Mistral-7B-Instruct-v0.2": print("EOS Token ID set") stop_token_id = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("\")]
and correspondingly with "<|eot_id|>" for Llama3-8B-Instruct. When setting the eos_token_id in model.generate() to values other than None, the transformers library wants you to pass an attention mask to the LLM as input. This is why the tokenize function also has to be modified to return the attention mask in addition to the tokenized input. Fixing this requires changing quite a few small things in the code which is why I wanted to discuss the issue here first.