h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
http://h2o.ai
Apache License 2.0
11.19k stars 1.23k forks source link

add signed cert for windows #938

Open andreasluemkemann opened 11 months ago

andreasluemkemann commented 11 months ago

Hello,

before I start asking questions here, I would like to say thanks first. Among all AI projects related to this topic that I try, h2oGPT is always the fastest in terms of supporting new technological developments. Also, the well thought-out user interface and the convenient choice of numerous models with download function are unparalleled. Fortunately, thanks to the Windows installer, the barrier to entry is low - allowing even non-experts to get up and running quickly. From my point of view a rare symbiosis of expertise and user-friendliness. Thank you!

I run the software under Linux in Docker and under Windows using the installer. Immeditaley when I noticed you put Mistral in, I was curious to try it out. Be it on docker or windows, I was not able to get it running. In Windows I downloaded the installer from october 6th (my birthday :) and tried it again. I also tried to exchange the files you mentioned in your pull request to no success.

Then I tried to run the inference on Huggingface, following the instructions in "README_InferenceServers.md" as far as I could. Unfortunately my skills are obviously not sufficient.

I have 2 questions: 1) How do I get the Mistral 7b model installed on Windows via the GUI? 2) More of a request than a question: could anyone walk me through how to run the inference on Huggingface instead of running the models locally? Is it possible to use the free Inference API? If not, I also have a paid HF account - what is the best way to set up the inference interface at HF? How do I make it run?

Many thanks in advance, Andy

pseudotensor commented 11 months ago

Thanks for the kind words!

The windows installer doesn't have the prompting in it, I'll need to rebuild it, since that change was after. I'll rebuild it sometime in next few hours. We don't have automatic builds for windows/mac.

The docker should have that change as of this morning. What problem do you encounter? Ensure you update requirements, i.e.

pip install -r requirements.txt

since transformers needs to be newer. Some others have had issues if flash_attn package was installed, so I recommend uninstalling that.

pip uninstall flash-attn

As of new docker from this morning, this works for me:

docker pull gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0

mkdir -p ~/.cache
mkdir -p ~/save
export CUDA_VISIBLE_DEVICES=0
docker run \
       --gpus all \
       --runtime=nvidia \
       --shm-size=2g \
       -p 7860:7860 \
       --rm --init \
       --network host \
       -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
       -v /etc/passwd:/etc/passwd:ro \
       -v /etc/group:/etc/group:ro \
       -u `id -u`:`id -g` \
       -v "${HOME}"/.cache:/workspace/.cache \
       -v "${HOME}"/save:/workspace/save \
       gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \
          --base_model=mistralai/Mistral-7B-Instruct-v0.1 \
          --save_dir='/workspace/save/' \
          --use_gpu_id=False \
          --score_model=None \
          --max_max_new_tokens=2048 \
          --max_new_tokens=1024

One does not have to add --prompt_type=mistral since it's built into h2ogpt for that particular base model name.

Also, don't pass --use_safetensors=True as not applicable.

gives:

image

FYI the docker command for vLLM would look like:


docker run -d \
    --runtime=nvidia \
    --gpus '"device=0"' \
    --shm-size=10.24gb \
    -p 5000:5000 \
    --entrypoint /h2ogpt_conda/vllm_env/bin/python3.10 \
    -e NCCL_IGNORE_DISABLED_P2P=1 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:/workspace/.cache \
    --network host \
    gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 -m vllm.entrypoints.openai.api_server \
        --port=5000 \
        --host=0.0.0.0 \
        --model=mistralai/Mistral-7B-Instruct-v0.1 \
        --tensor-parallel-size=1 \
        --seed 1234 \
        --trust-remote-code \
    --max-num-batched-tokens 65536 \
        --download-dir=/workspace/.cache/huggingface/hub &>> logs.vllm_server.mistral.txt

This is so the model has full access to the context, at least 8k or more.

You can connect to from h2oGPT if pass both --base_model and --inference_server.

andreasluemkemann commented 10 months ago

Thank you for this quick and excellent response!

I must admit: Besides excellent usability and latest technology in your project, you also complete your ambitious project with excellent support. Did you ever think about doing this for a living? ;)

With your help I was able to get the Docker container running with Mistral and local graphics card, thanks a lot!

When I ran the windows installer and those BS warnings from the browser and M$ popped up, I got an idea how I could support your great project.

I thought of donating a code signing certificate with smartcard reader for your project. I know this is actually snake oil, but for the group of (perhaps fearful) Windows users, a signed installer might break down inhibitions against free software.

Another thing I noticed inferencing different AI models: (NVMe) storage is wasted by binary identical models in differing folders and volumes. I was successful with data deduplication, see the screenshot.

If you think it helps, I'd be happy to do a tutorial on it.

Best regards, Andy

image

pseudotensor commented 10 months ago

Thanks. I'll check on duplication and good idea about signed certificate for windows!