h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
http://h2o.ai
Apache License 2.0
11.25k stars 1.23k forks source link

BUG? UI Stops Writing Output #943

Closed stratus-ss closed 11 months ago

stratus-ss commented 11 months ago

Browsers: Firefox (latest) & Brave (latest)

I'm not exactly sure how to explain this but I have stood up h2o in docker with the following command

CUDA_VISIBLE_DEVICES="0,1"
docker run        --gpus '"device=0,1"' \
        --runtime=nvidia \
        --shm-size=10g \
        -p 7860:7860 \
        --rm --init  \
        --network host \
        -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
        -v /etc/passwd:/etc/passwd:ro \
        -v /etc/group:/etc/group:ro \
        -u `id -u`:`id -g` \
        -v /root/llama-gpt/models/h20:/workspace/.cache \
        -v /root/llama-gpt/models/h20/save:/workspace/save \
        gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \
           --base_model=h2oai/h2ogpt-16k-codellama-13b-python \
           --use_safetensors=True \
           --prompt_type=llama2 \
           --save_dir='/workspace/save/' \
           --use_gpu_id=False \
           --score_model=None \
           --max_max_new_tokens=2048 \
           --max_new_tokens=1024

This appears to function as it should. You can interact with the UI, download new models etc. However I have been experiencing two very odd symptoms which I suspect are related though I cant say for sure

  1. On somewhat longer questions, you enter the question and the ellipsis shows up like it's thinking (I can see activity in nvtop) and the new element where text normally appears shows up but nothing happens. No output and no more activity is seen via nvtop. There are no logs in docker beyond the standard query received messages.

On a whim I tried inputting the same question multiple times. As in, cut and paste, wait for the failure and then try again. On the 3rd try, consistently, the UI will then start providing an answer. However it always stops short.

  1. On shorter queries such as:

I am specifically trying to write a unit test to test the following decorator I created

def check_admin(user):
   """if user has access to restricted views"""
   return user.is_staff or user.groups.filter(name="admin").exists()

@method_decorator(user_passes_test(check_admin), name='dispatch')

How would I use mock to 'stand up' redis and elastic search just to write this unit test

The model starts to respond as expected but then stops part way through the response.

In addition, the syntax highlighting breaks often like in the following example:

\begin{code}

<some response code>

\end{code}

I have reproduced this on both the models--h2oai--h2ogpt-16k-codellama-13b-python and models--h2oai--h2ogpt-4096-llama2-7b-chat

System specs are below:

System:
  Host: gpt-gpu.stratus.lab Kernel: 5.14.0-284.30.1.el9_2.x86_64 arch: x86_64
    bits: 64 Console: pty pts/0 Distro: Red Hat Enterprise Linux release 9.2
    (Plow)
Machine:
  Type: Desktop System: ASUS product: All Series v: N/A serial: N/A
  Mobo: ASUSTeK model: Z97-WS v: Rev 1.xx serial: 140525546300294
    UEFI: American Megatrends v: 2704 date: 01/14/2016
Memory:
  System RAM: total: 32 GiB available: 31.03 GiB used: 4.04 GiB (13.0%)
  Array-1: capacity: 32 GiB slots: 4 modules: 4 EC: None
  Device-1: DIMM_A1 type: DDR3 size: 8 GiB speed: 1333 MT/s
  Device-2: DIMM_A2 type: DDR3 size: 8 GiB speed: spec: 1600 MT/s
    actual: 1333 MT/s
  Device-3: DIMM_B1 type: DDR3 size: 8 GiB speed: 1333 MT/s
  Device-4: DIMM_B2 type: DDR3 size: 8 GiB speed: spec: 1600 MT/s
    actual: 1333 MT/s
CPU:
  Info: quad core Intel Core i5-4440 [MCP] speed (MHz): avg: 3099
    min/max: 800/3300
Graphics:
  Device-1: NVIDIA GA106 [GeForce RTX 3060 Lite Hash Rate] driver: nvidia
    v: 535.104.05
  Device-2: NVIDIA GP102GL [Quadro P6000] driver: nvidia v: 535.104.05
  Display: server: X.org v: 1.20.11 driver: N/A tty: 80x56
  API: OpenGL Message: GL data unavailable in console, glxinfo missing.
Network:
  Device-1: Intel Ethernet I218-LM driver: e1000e
  Device-2: Intel I210 Gigabit Network driver: igb

The model is being run off an NFS share.

as an aside, I have been successfully running LLamaGPT from the Umbrel folks on this same hardware without these issues, which sort of casts some doubt about it being a hardware problem.

What can/should I be looking into? How do I get more debugging info?

MSZ-MGS commented 11 months ago

Same issue on Windows, no responses are received, I deleted H2ogpt environment from conda and re-installed everything from scratch. Still same issue. image

I prompted llama-2 3 times saying Hi, no responses you can see the load spikes on the CPU and GPU.

However, usually I keep a full copy of this environment folder from: G:\Users\mazen\miniconda3\envs

I returned to the old one and it works fine (with few issues) but Reponses are working.

MSZ-MGS commented 11 months ago

Update: When I returned to https://github.com/h2oai/h2ogpt/tree/a1b4716f62e512af678c2e6ea863e2131ffbaa4a Everything worked fine

pseudotensor commented 11 months ago

@MSZ-MGS Can you provide your run line that does this?

MSZ-MGS commented 11 months ago

Hi, sir, nice to see again. I don't know if you mean this by "Run Line" python generate.py --base_model='llama' --model_path_llama=AddedModels\llama-2-7b-chat\llama-2-7b-chat.Q5_K_M.gguf --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path --llamacpp_dict="{'n_gpu_layers':35,'n_batch':128}" --do_sample=True --top_k_docs=-1 --temperature=0.7 --top_p=0.9 --max_seq_len=4095 --share=False

pseudotensor commented 11 months ago

@MSZ-MGS Thanks, pushed fix for that. I changed things recently to have LLM code only push new tokens and not prompt, and a few LLM types were not updated. I pushed fix for llama.cpp models, and I know openai/vllm/tgi/torch/exllama/replicate/TGI ones already work, so just gpt4all ones to check

stratus-ss commented 11 months ago

should I expect to be able to to pull the docker image and have the fix?

pseudotensor commented 11 months ago

@stratus-ss Your case using h2oGPT docker on torch model shouldn't be having issues. I'll try your exact command.

pseudotensor commented 11 months ago

With this:

sudo rm -rf ~/save/
mkdir -p ~/save
docker run        --gpus '"device=0,1"'        --runtime=nvidia        --shm-size=2g        -p 7860:7860        --rm --init        --network host        -v /etc/passwd:/etc/passwd:ro        -v /etc/group:/etc/group:ro        -u `id -u`:`id -g`        -v "${HOME}"/.cache:/workspace/.cache        -v "${HOME}"/save:/workspace/save        gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py           --base_model=h2oai/h2ogpt-16k-codellama-13b-python           --use_safetensors=True           --prompt_type=llama2           --save_dir='/workspace/save/'           --use_gpu_id=False           --score_model=None           --max_max_new_tokens=2048           --max_new_tokens=1024

I see same thing that response appears slightly but is rapidly and completely deleted. Probably related to same issues discussed above for GGML.

However, if I run this on main outside docker it is fine:

python generate.py --base_model=h2oai/h2ogpt-4096-llama2-7b-chat --use_safetensors=True --prompt_type=llama2 --save_dir='/home/jon/save/' --use_gpu_id=False --score_model=None --max_max_new_tokens=2048 --max_new_tokens=1024

Still looking...

pseudotensor commented 11 months ago

Hmm, this also loses streamed output, just different model using code llama:

python --base_model=h2oai/h2ogpt-16k-codellama-13b-python --use_safetensors=True --prompt_type=llama2 --save_dir='/home/jon/save/' --use_gpu_id=False --score_model=None --max_max_new_tokens=2048 --max_new_tokens=1024
stratus-ss commented 11 months ago

let me know what i can do to help debug

pseudotensor commented 11 months ago

@stratus-ss Oh, duh, you are using a foundational model, not an instruct or chat model. What is your intention with the model?

If you use h2oai/h2ogpt-16k-codellama-13b-python you will need to stick to --prompt_type=plain and use as a model that continues what you do. h2oGPT doesn't do infilling etc.

If you want to have an instruct model you should use https://huggingface.co/h2oai/h2ogpt-16k-codellama-13b-instruct.

pseudotensor commented 11 months ago

Closing since solved. Feel free to ask more questions, I will still respond.

stratus-ss commented 11 months ago

if you note in the original description this is present in the chat model as well models--h2oai--h2ogpt-4096-llama2-7b-chat

stratus-ss commented 11 months ago

@pseudotensor I will change the model again and see what happens but I had this on the on the chat model as well

MSZ-MGS commented 11 months ago

@MSZ-MGS Thanks, pushed fix for that. I changed things recently to have LLM code only push new tokens and not prompt, and a few LLM types were not updated. I pushed fix for llama.cpp models, and I know openai/vllm/tgi/torch/exllama/replicate/TGI ones already work, so just gpt4all ones to check

Resolution is confirmed. Thank you!