Closed stratus-ss closed 11 months ago
Same issue on Windows, no responses are received, I deleted H2ogpt environment from conda and re-installed everything from scratch. Still same issue.
I prompted llama-2 3 times saying Hi, no responses you can see the load spikes on the CPU and GPU.
However, usually I keep a full copy of this environment folder from: G:\Users\mazen\miniconda3\envs
I returned to the old one and it works fine (with few issues) but Reponses are working.
Update: When I returned to https://github.com/h2oai/h2ogpt/tree/a1b4716f62e512af678c2e6ea863e2131ffbaa4a Everything worked fine
@MSZ-MGS Can you provide your run line that does this?
Hi, sir, nice to see again. I don't know if you mean this by "Run Line" python generate.py --base_model='llama' --model_path_llama=AddedModels\llama-2-7b-chat\llama-2-7b-chat.Q5_K_M.gguf --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path --llamacpp_dict="{'n_gpu_layers':35,'n_batch':128}" --do_sample=True --top_k_docs=-1 --temperature=0.7 --top_p=0.9 --max_seq_len=4095 --share=False
@MSZ-MGS Thanks, pushed fix for that. I changed things recently to have LLM code only push new tokens and not prompt, and a few LLM types were not updated. I pushed fix for llama.cpp models, and I know openai/vllm/tgi/torch/exllama/replicate/TGI ones already work, so just gpt4all ones to check
should I expect to be able to to pull the docker image and have the fix?
@stratus-ss Your case using h2oGPT docker on torch model shouldn't be having issues. I'll try your exact command.
With this:
sudo rm -rf ~/save/
mkdir -p ~/save
docker run --gpus '"device=0,1"' --runtime=nvidia --shm-size=2g -p 7860:7860 --rm --init --network host -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -u `id -u`:`id -g` -v "${HOME}"/.cache:/workspace/.cache -v "${HOME}"/save:/workspace/save gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py --base_model=h2oai/h2ogpt-16k-codellama-13b-python --use_safetensors=True --prompt_type=llama2 --save_dir='/workspace/save/' --use_gpu_id=False --score_model=None --max_max_new_tokens=2048 --max_new_tokens=1024
I see same thing that response appears slightly but is rapidly and completely deleted. Probably related to same issues discussed above for GGML.
However, if I run this on main outside docker it is fine:
python generate.py --base_model=h2oai/h2ogpt-4096-llama2-7b-chat --use_safetensors=True --prompt_type=llama2 --save_dir='/home/jon/save/' --use_gpu_id=False --score_model=None --max_max_new_tokens=2048 --max_new_tokens=1024
Still looking...
Hmm, this also loses streamed output, just different model using code llama:
python --base_model=h2oai/h2ogpt-16k-codellama-13b-python --use_safetensors=True --prompt_type=llama2 --save_dir='/home/jon/save/' --use_gpu_id=False --score_model=None --max_max_new_tokens=2048 --max_new_tokens=1024
let me know what i can do to help debug
@stratus-ss Oh, duh, you are using a foundational model, not an instruct or chat model. What is your intention with the model?
If you use h2oai/h2ogpt-16k-codellama-13b-python
you will need to stick to --prompt_type=plain
and use as a model that continues what you do. h2oGPT doesn't do infilling etc.
If you want to have an instruct model you should use https://huggingface.co/h2oai/h2ogpt-16k-codellama-13b-instruct
.
Closing since solved. Feel free to ask more questions, I will still respond.
if you note in the original description this is present in the chat model as well models--h2oai--h2ogpt-4096-llama2-7b-chat
@pseudotensor I will change the model again and see what happens but I had this on the on the chat model as well
@MSZ-MGS Thanks, pushed fix for that. I changed things recently to have LLM code only push new tokens and not prompt, and a few LLM types were not updated. I pushed fix for llama.cpp models, and I know openai/vllm/tgi/torch/exllama/replicate/TGI ones already work, so just gpt4all ones to check
Resolution is confirmed. Thank you!
Browsers: Firefox (latest) & Brave (latest)
I'm not exactly sure how to explain this but I have stood up h2o in docker with the following command
This appears to function as it should. You can interact with the UI, download new models etc. However I have been experiencing two very odd symptoms which I suspect are related though I cant say for sure
nvtop
. There are no logs in docker beyond the standard query received messages.On a whim I tried inputting the same question multiple times. As in, cut and paste, wait for the failure and then try again. On the 3rd try, consistently, the UI will then start providing an answer. However it always stops short.
I am specifically trying to write a unit test to test the following decorator I created
How would I use mock to 'stand up' redis and elastic search just to write this unit test
The model starts to respond as expected but then stops part way through the response.
In addition, the syntax highlighting breaks often like in the following example:
I have reproduced this on both the
models--h2oai--h2ogpt-16k-codellama-13b-python
andmodels--h2oai--h2ogpt-4096-llama2-7b-chat
System specs are below:
The model is being run off an NFS share.
as an aside, I have been successfully running LLamaGPT from the Umbrel folks on this same hardware without these issues, which sort of casts some doubt about it being a hardware problem.
What can/should I be looking into? How do I get more debugging info?