Closed Roerib closed 1 year ago
Make sure that you have the new quantized models. Links to them can be found on the wiki: https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model
yeah i pulled the model from here https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617
this also happens with other models in 8bit mode.
Pymgalion6b-Dev
Traceback (most recent call last):
File "C:\Users\Robert\Desktop\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 64, in gentask
ret = self.mfunc(callback=_callback, self.kwargs)
File "C:\Users\Robert\Desktop\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 222, in generate_with_callback
shared.model.generate(kwargs)
File "C:\Users\Robert\Desktop\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, *kwargs)
File "C:\Users\Robert\Desktop\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "C:\Users\Robert\Desktop\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2560, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf
, nan
or element < 0
Output generated in 3.33 seconds (0.00 tokens/s, 0 tokens, context 345)
Traceback (most recent call last):
File "C:\Users\Robert\Desktop\oobabooga-windows\installer_files\env\lib\site-packages\gradio\routes.py", line 393, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\Robert\Desktop\oobabooga-windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1108, in process_api
result = await self.call_function(
File "C:\Users\Robert\Desktop\oobabooga-windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 929, in call_function
prediction = await anyio.to_thread.run_sync(
File "C:\Users\Robert\Desktop\oobabooga-windows\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\Robert\Desktop\oobabooga-windows\installer_files\env\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "C:\Users\Robert\Desktop\oobabooga-windows\installer_files\env\lib\site-packages\anyio_backends_asyncio.py", line 867, in run
result = context.run(func, args)
File "C:\Users\Robert\Desktop\oobabooga-windows\installer_files\env\lib\site-packages\gradio\utils.py", line 490, in async_iteration
return next(iterator)
File "C:\Users\Robert\Desktop\oobabooga-windows\text-generation-webui\modules\chat.py", line 184, in regenerate_wrapper
for _history in chatbot_wrapper(last_internal[0], max_new_tokens, do_sample, temperature, top_p, typical_p, repetition_penalty, encoder_repetition_penalty, top_k, min_length, no_repeat_ngram_size, num_beams, penalty_alpha, length_penalty, early_stopping, seed, name1, name2, context, check, chat_prompt_size, chat_generation_attempts, regenerate=True):
File "C:\Users\Robert\Desktop\oobabooga-windows\text-generation-webui\modules\chat.py", line 144, in chatbot_wrapper
cumulative_reply = reply
UnboundLocalError: local variable 'reply' referenced before assignment
In the logs you posted, the real error is above that one: TypeError: vecquant4matmul(): incompatible function arguments.
The reply error is being caused by that one. Were you able to compile the cuda kernals from GPTQ-for-LLaMa?
Make sure that the GPTQ-for-LLaMa that you have has been switched to the cuda
branch. The main branch is no longer compatible with Windows.
Although, if the reply error is happening with 8-bit, it may be a separate issue.
I am unable to replicate these errors. There have been a lot of updates recently. You may need to fully reinstall the webui to ensure that there are no issues. Make sure that you use the latest version of the installer here: https://github.com/oobabooga/one-click-installers
If you don't want to have to download all the conda packages again, then you can do this:
python -m pip uninstall quant_cuda
reinstalled and now it just works.
Hey @Roerib ! Do you have a GTX 1070 card and you can run TextGen with it, right? Can u chat and use all its functions ?? If so can you help me pls, I have the same card but something is wrong and after loading the model i get no answer at the chat tab ... TXH!
@No565 i remember i was able to get it working but i wasn't able to use anything more than a 13b model. I don't have the card in my system so i can't help you anymore. Maybe your model size is too big. Use a smaller model or use quantization.
Describe the bug
model loads but won't output anything.
Is there an existing issue for this?
Reproduction
loaded llama 7b with 4bit
Screenshot
Logs
System Info