turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

Non stop generation after update to v0.1.1.0 and latest flash attention #484

Closed waterangel91 closed 3 weeks ago

waterangel91 commented 3 weeks ago

I have just updated to latest version today, and all my model seems to generate non stop garbabe eventhough i have explicitly put in eos. I am having this issue with all my model: mistral, mixtral, llama3.

Appreciate if anyone can tell what I did wrong, the same code worked fine before the upgrade

Selected snippet from my code as below ` settings = ExLlamaV2Sampler.Settings() settings.temperature = temperature settings.top_k = 50 settings.top_p = 0.8 settings.min_p = 0.35 settings.token_repetition_penalty = 1.1

generate_text = generator.generate( prompt=prompt, max_new_tokens=self.max_tokens-len(input_ids[0]), gen_settings=settings, stop_conditions=self.eos_token_str if self.eos_token_str else None, completion_only=True, ) `

Sample of the non stop generation

Screenshot from 2024-06-01 22-19-42

turboderp commented 3 weeks ago

stop_conditions needs to be a list, so if you're passing a single string, you should pass it as stop_conditions = [eos_token_str].

Be aware that tokens like EOS are usually defined as "special tokens" by the model. So they're not decoded by default and don't become strings (and therefore couldn't trigger a stop condition) unless you specify decode_special_tokens = True. However, if you do have a stop token it's usually better to provide its ID as a stop condition instead. This eliminates any ambiguity since a string like "" could also appear naturally in the output text stream. You can get the ID from the string with eos_token_id = tokenizer.single_id(eos_token_str) and then pass it with stop_conditions = [eos_token_id].

waterangel91 commented 3 weeks ago

Thank you for the advice. After set decode_special_tokens = True as well as the stop_conditions then the issue stopped.

Since have you here, can I ask if there is any way to access this result for the non stream generator? Or should i just use the stream generator for the non stream? Is there a speed difference between stream and non stream mode?

{'job': ExLlamaV2DynamicJob #2, 'stage': 'streaming', 'eos': True, 'serial': 2, 'eos_reason': 'stop_token', 'full_completion': 'The current President of the United States is Joe Biden. He assumed office on January 20, 2021.', 'new_tokens': 27, 'prompt_tokens': 42, 'time_enqueued': 7.724761962890625e-05, 'time_prefill': 0.040158987045288086, 'time_generate': 0.1675558090209961, 'cached_pages': 0, 'cached_tokens': 0}

turboderp commented 3 weeks ago

The generate function just uses streaming under the hood so there's not going to be any speed difference.

You can get the last result dict returned if you add return_last_result = True to the call to generate. The return value will become a tuple:

waterangel91 commented 3 weeks ago

Thank you very much. Somehow I am still encountering issue with non stop generation here and there. So I ended up downgrad back to v0.0.21 for now. Maybe I will try again next version.

turboderp commented 3 weeks ago

All the code you were using in v0.0.21 is still there in v0.1.1. You don't have to use the dynamic generator.

waterangel91 commented 3 weeks ago

Thank, I just upgraded to v.0.1.3 and seems like all my old code working without any modification. Will close this thread and re-explore dynamic generator next time.