Running into this issue when trying to generate with over around 130 tokens in context on my M40. Generation works fine for small contexts, but errors out at larger contexts than around 130 or so. max_length for generation defaults to 150 if that affects things. My 2080 ti does not have this issue, and will happily generate larger prompts.
Running into this issue when trying to generate with over around 130 tokens in context on my M40. Generation works fine for small contexts, but errors out at larger contexts than around 130 or so. max_length for generation defaults to 150 if that affects things. My 2080 ti does not have this issue, and will happily generate larger prompts.
Running the cuda branch because M40 is too old for triton branch to run. Model: https://huggingface.co/anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g