intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.45k stars 1.24k forks source link

Patching failed when building the base images in trusted-bigdl-llm, and TypeError when run the Fastchat in ppml. #8961

Open Jansper-x opened 12 months ago

Jansper-x commented 12 months ago

I try to use trusted-bigdl-llm to run the Fastchat. When I build the base image,I got the patching failed error:

patching file /usr/local/lib/python3.9/dist-packages/bigdl/llm/utils/utils.py
Hunk #1 FAILED at 22.
1 out of 1 hunk FAILED -- saving rejects to file /usr/local/lib/python3.9/dist-packages/bigdl/llm/utils/utils.py.rej

image

My Dockerfile:

ARG BIGDL_VERSION=2.4.0-SNAPSHOT
ARG BASE_IMAGE_NAME=intelanalytics/bigdl-ppml-gramine-base
ARG BASE_IMAGE_TAG=2.4.0-SNAPSHOT

FROM $BASE_IMAGE_NAME:$BASE_IMAGE_TAG 
ARG http_proxy
ARG https_proxy
ARG no_proxy

ARG http_proxy
ARG https_proxy

ADD utils.diff /opt/utils.diff
ADD llm_cli.diff /opt/llm_cli.diff

RUN pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple  --pre --upgrade bigdl-llm[all]==2.4.0b20230806 && \
    # Remove all dependencies from nvidia as we are supposed to run in SGX
    pip3 list | grep nvidia | awk '{print $1}' | xargs pip3 uninstall -y && \
    # Replace bigdl-llm repo's with our our dependencies
    cd /usr/local/lib/python3.9/dist-packages/bigdl/llm/libs && \
    rm *avx512* && \
    rm quantize* && \
    curl "https://sourceforge.net/projects/analytics-zoo/rss?path=/ppml-llm" | grep "<link>.*</link>" | sed 's|<link>||;s|</link>||' | while read url; do url=`echo $url | sed 's|/download$||'`; wget $url ; done && \
    rm index.html && \
    chmod +x * && \
    cd /ppml && \
    # Patch subprocess call in bigdl-llm
    patch -R /usr/local/lib/python3.9/dist-packages/bigdl/llm/utils/utils.py /opt/utils.diff && \
    patch -R /usr/local/bin/llm-cli /opt/llm_cli.diff && \
    # Gramine commands
    gramine-argv-serializer bash -c 'export TF_MKL_ALLOC_MAX_BYTES=10737418240 && $sgx_command' > /ppml/secured_argvs && \
    cd /ppml/ && \
    curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
    # Install FastChat from source requires PEP 660 support
    python3 get-pip.py && \
    rm get-pip.py && \
    # Install FastChat
    git clone https://github.com/analytics-zoo/FastChat.git && \
    cd /ppml/FastChat && \
    git checkout dev-2023-08-01 && \
    pip install -e . && \
    # Pin gradio version because this error:https://github.com/lm-sys/FastChat/issues/1925
    pip install --pre --upgrade gradio==3.36.1 && \
    pip install --pre --upgrade bigdl-nano && \
    apt-get install -y libunwind8-dev && \
    mkdir /ppml/data

WORKDIR /ppml

Then I use the public reference image provided by BigDL PPML intelanalytics/bigdl-ppml-trusted-bigdl-llm-gramine-ref:2.4.0-SNAPSHOT to run the fastchat

sudo docker run -itd --net=host --cpus=16 --oom-kill-disable --device=/dev/sgx/enclave --device=/dev/sgx/provision --name=bigdl-ppml-client-local -v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket -v /root/LLM-TEE/chatglm.cpp/chatglm2-6b-int4:/ppml/chatglm2-6b-int4  -v /root/nfs_share/LLM_DEV/ChatGLM2-6B/chatglm-6b:/ppml/chatglm-6b  -v /root/BigDL23/BigDL/ppml/keys:/ppml/keys  -e RUNTIME_DRIVER_PORT=54321 -e RUNTIME_DRIVER_MEMORY=32G  -e LOCAL_IP=11.50.52.28 intelanalytics/bigdl-ppml-trusted-bigdl-llm-gramine-ref:2.4.0-SNAPSHOT bash
docker exec -it bigdl-ppml-client-local python3 -m fastchat.serve.cli --model-path /ppml/chatglm-6b --device cpu

But I got this error:

(base) root@tee-cluster-sgx1:~/BigDL23/BigDL/ppml/trusted-bigdl-llm/ref# docker exec -it bigdl-ppml-client-local python3 -m fastchat.serve.cli --model-path /ppml/chatglm-6b --device cpu
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:13<00:00,  1.97s/it]
问: 你好
答: Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/ppml/FastChat/fastchat/serve/cli.py", line 239, in <module>
    main(args)
  File "/ppml/FastChat/fastchat/serve/cli.py", line 176, in main
    chat_loop(
  File "/ppml/FastChat/fastchat/serve/inference.py", line 368, in chat_loop
    outputs = chatio.stream_output(output_stream)
  File "/ppml/FastChat/fastchat/serve/cli.py", line 40, in stream_output
    for outputs in output_stream:
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "/ppml/FastChat/fastchat/model/model_chatglm.py", line 71, in generate_stream_chatglm
    for total_ids in model.stream_generate(**inputs, **gen_kwargs):
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1145, in stream_generate
    outputs = self(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 934, in forward
    transformer_outputs = self.transformer(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 830, in forward
    hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 640, in forward
    layer_ret = layer(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 544, in forward
    attention_output, kv_cache = self.self_attention(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: chatglm_attention_forward() got an unexpected keyword argument 'kv_cache'
gc-fu commented 11 months ago

Sorry, I cannot reproduce the second issue :disappointed_relieved:

This is my environment:


#/bin/bash

export DOCKER_IMAGE=intelanalytics/bigdl-ppml-trusted-bigdl-llm-gramine-ref:2.4.0-SNAPSHOT
export MODEL_PATH="/home/llm/models/"

sudo docker run -itd \
        --net=host \
        --cpuset-cpus="5-20" \
        --cpuset-mems="0" \
        --memory="32G" \
        --name=gc-debug \
        -v $MODEL_PATH:/ppml/models \
        --shm-size="16g" \
        $DOCKER_IMAGE

Then, docker exec -it gc-debug bash

I tried both chatglm-6b and chatglm2-6b models, and they both worked fine in my environment:

image image

Do you use our latest intelanalytics/bigdl-ppml-trusted-bigdl-llm-gramine-ref:2.4.0-SNAPSHOT image? :eyes:

You can also add a print statement at /ppml/FastChat/fastchat/model/model_adapter.py L545 to ensure that the patched ChatGLM model adapter is used.

For the first issue, @hzjane can you help to fix this issue?

hzjane commented 11 months ago

The first issue may be caused by changes in the BigDL-LLM version. To address this, You can use the latest Dockerfile to build.

Jansper-x commented 11 months ago

Sorry, I cannot reproduce the second issue 😥

This is my environment:

#/bin/bash

export DOCKER_IMAGE=intelanalytics/bigdl-ppml-trusted-bigdl-llm-gramine-ref:2.4.0-SNAPSHOT
export MODEL_PATH="/home/llm/models/"

sudo docker run -itd \
        --net=host \
        --cpuset-cpus="5-20" \
        --cpuset-mems="0" \
        --memory="32G" \
        --name=gc-debug \
        -v $MODEL_PATH:/ppml/models \
        --shm-size="16g" \
        $DOCKER_IMAGE

Then, docker exec -it gc-debug bash

I tried both chatglm-6b and chatglm2-6b models, and they both worked fine in my environment:

image image Do you use our latest intelanalytics/bigdl-ppml-trusted-bigdl-llm-gramine-ref:2.4.0-SNAPSHOT image? 👀

You can also add a print statement at /ppml/FastChat/fastchat/model/model_adapter.py L545 to ensure that the patched ChatGLM model adapter is used.

For the first issue, @hzjane can you help to fix this issue?

Thanks, I have solved this problem. The reason is that I mistakenly named the model folder of chatglm2-6b as chatglm-6b before. This may cause the error in model loading. I have changed the folder name back to chatglm2-6b and it loads and runs normally. Now I can run the model and perform inference in Docker.

截屏2023-09-13 14 01 14

I noticed that BigDL accelerates the computation of inference by using 512 AVX. But when I run the Fastchat on my local host without bigdl and in the docker images with bigdl.The inference times I got are not much different. Sometimes it takes longer in the docker. For example: On my local platform this question takes 36 seconds, but in the docker images it takes 62 seconds.

截屏2023-09-13 14 29 52 截屏2023-09-13 14 36 23

Does this require additional configuration? I have chack my cpu suppport the AVX512.

截屏2023-09-13 14 43 33
gc-fu commented 11 months ago

Do you use the same core/mem configuration when you run in container and local machine?

Besides, please try --cpuset-cpus option instead of --cpus

We recommend to use source bigdl-nano-init -t to setup the environment, try do the following:

source bigdl-nano-init -t
# bigdl-nano-init may set wrong number of cores in container
export OMP_NUM_THREADS="YOUR_CORE_NUM"

Sorry, I cannot reproduce the second issue 😥 This is my environment:

#/bin/bash

export DOCKER_IMAGE=intelanalytics/bigdl-ppml-trusted-bigdl-llm-gramine-ref:2.4.0-SNAPSHOT
export MODEL_PATH="/home/llm/models/"

sudo docker run -itd \
        --net=host \
        --cpuset-cpus="5-20" \
        --cpuset-mems="0" \
        --memory="32G" \
        --name=gc-debug \
        -v $MODEL_PATH:/ppml/models \
        --shm-size="16g" \
        $DOCKER_IMAGE

Then, docker exec -it gc-debug bash I tried both chatglm-6b and chatglm2-6b models, and they both worked fine in my environment: image image Do you use our latest intelanalytics/bigdl-ppml-trusted-bigdl-llm-gramine-ref:2.4.0-SNAPSHOT image? 👀 You can also add a print statement at /ppml/FastChat/fastchat/model/model_adapter.py L545 to ensure that the patched ChatGLM model adapter is used. For the first issue, @hzjane can you help to fix this issue?

Thanks, I have solved this problem. The reason is that I mistakenly named the model folder of chatglm2-6b as chatglm-6b before. This may cause the error in model loading. I have changed the folder name back to chatglm2-6b and it loads and runs normally. Now I can run the model and perform inference in Docker. 截屏2023-09-13 14 01 14 I noticed that BigDL accelerates the computation of inference by using 512 AVX. But when I run the Fastchat on my local host without bigdl and in the docker images with bigdl.The inference times I got are not much different. Sometimes it takes longer in the docker. For example: On my local platform this question takes 36 seconds, but in the docker images it takes 62 seconds. 截屏2023-09-13 14 29 52 截屏2023-09-13 14 36 23 Does this require additional configuration? I have chack my cpu suppport the AVX512. 截屏2023-09-13 14 43 33

Jansper-x commented 11 months ago

I increased the numbers of cores and memory and the inference was faster inside the container than outside. I want to learn more information about how bigdl-llm uses avx for acceleration. How is this acceleration implemented ? Where is this part of the code? Is there any change or patching on pytorch?

jason-dai commented 11 months ago

I increased the numbers of cores and memory and the inference was faster inside the container than outside. I want to learn more information about how bigdl-llm uses avx for acceleration. How is this acceleration implemented ? Where is this part of the code? Is there any change or patching on pytorch?

bigdl-llm will automatically apply intel specific optimizations (including amx/vnni/avx/etc.) to LLM; no change/patching to pytorch is needed