Patching failed when building the base images in trusted-bigdl-llm, and TypeError when run the Fastchat in ppml.

I try to use trusted-bigdl-llm to run the Fastchat. When I build the base image，I got the patching failed error：

patching file /usr/local/lib/python3.9/dist-packages/bigdl/llm/utils/utils.py
Hunk #1 FAILED at 22.
1 out of 1 hunk FAILED -- saving rejects to file /usr/local/lib/python3.9/dist-packages/bigdl/llm/utils/utils.py.rej

My Dockerfile:

ARG BIGDL_VERSION=2.4.0-SNAPSHOT
ARG BASE_IMAGE_NAME=intelanalytics/bigdl-ppml-gramine-base
ARG BASE_IMAGE_TAG=2.4.0-SNAPSHOT

FROM $BASE_IMAGE_NAME:$BASE_IMAGE_TAG 
ARG http_proxy
ARG https_proxy
ARG no_proxy

ARG http_proxy
ARG https_proxy

ADD utils.diff /opt/utils.diff
ADD llm_cli.diff /opt/llm_cli.diff

RUN pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple  --pre --upgrade bigdl-llm[all]==2.4.0b20230806 && \
    # Remove all dependencies from nvidia as we are supposed to run in SGX
    pip3 list | grep nvidia | awk '{print $1}' | xargs pip3 uninstall -y && \
    # Replace bigdl-llm repo's with our our dependencies
    cd /usr/local/lib/python3.9/dist-packages/bigdl/llm/libs && \
    rm *avx512* && \
    rm quantize* && \
    curl "https://sourceforge.net/projects/analytics-zoo/rss?path=/ppml-llm" | grep "<link>.*</link>" | sed 's|<link>||;s|</link>||' | while read url; do url=`echo $url | sed 's|/download$||'`; wget $url ; done && \
    rm index.html && \
    chmod +x * && \
    cd /ppml && \
    # Patch subprocess call in bigdl-llm
    patch -R /usr/local/lib/python3.9/dist-packages/bigdl/llm/utils/utils.py /opt/utils.diff && \
    patch -R /usr/local/bin/llm-cli /opt/llm_cli.diff && \
    # Gramine commands
    gramine-argv-serializer bash -c 'export TF_MKL_ALLOC_MAX_BYTES=10737418240 && $sgx_command' > /ppml/secured_argvs && \
    cd /ppml/ && \
    curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
    # Install FastChat from source requires PEP 660 support
    python3 get-pip.py && \
    rm get-pip.py && \
    # Install FastChat
    git clone https://github.com/analytics-zoo/FastChat.git && \
    cd /ppml/FastChat && \
    git checkout dev-2023-08-01 && \
    pip install -e . && \
    # Pin gradio version because this error:https://github.com/lm-sys/FastChat/issues/1925
    pip install --pre --upgrade gradio==3.36.1 && \
    pip install --pre --upgrade bigdl-nano && \
    apt-get install -y libunwind8-dev && \
    mkdir /ppml/data

WORKDIR /ppml

Then I use the public reference image provided by BigDL PPML intelanalytics/bigdl-ppml-trusted-bigdl-llm-gramine-ref:2.4.0-SNAPSHOT to run the fastchat

sudo docker run -itd --net=host --cpus=16 --oom-kill-disable --device=/dev/sgx/enclave --device=/dev/sgx/provision --name=bigdl-ppml-client-local -v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket -v /root/LLM-TEE/chatglm.cpp/chatglm2-6b-int4:/ppml/chatglm2-6b-int4  -v /root/nfs_share/LLM_DEV/ChatGLM2-6B/chatglm-6b:/ppml/chatglm-6b  -v /root/BigDL23/BigDL/ppml/keys:/ppml/keys  -e RUNTIME_DRIVER_PORT=54321 -e RUNTIME_DRIVER_MEMORY=32G  -e LOCAL_IP=11.50.52.28 intelanalytics/bigdl-ppml-trusted-bigdl-llm-gramine-ref:2.4.0-SNAPSHOT bash

docker exec -it bigdl-ppml-client-local python3 -m fastchat.serve.cli --model-path /ppml/chatglm-6b --device cpu

But I got this error:

(base) root@tee-cluster-sgx1:~/BigDL23/BigDL/ppml/trusted-bigdl-llm/ref# docker exec -it bigdl-ppml-client-local python3 -m fastchat.serve.cli --model-path /ppml/chatglm-6b --device cpu
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:13<00:00,  1.97s/it]
问: 你好
答: Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/ppml/FastChat/fastchat/serve/cli.py", line 239, in <module>
    main(args)
  File "/ppml/FastChat/fastchat/serve/cli.py", line 176, in main
    chat_loop(
  File "/ppml/FastChat/fastchat/serve/inference.py", line 368, in chat_loop
    outputs = chatio.stream_output(output_stream)
  File "/ppml/FastChat/fastchat/serve/cli.py", line 40, in stream_output
    for outputs in output_stream:
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "/ppml/FastChat/fastchat/model/model_chatglm.py", line 71, in generate_stream_chatglm
    for total_ids in model.stream_generate(**inputs, **gen_kwargs):
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1145, in stream_generate
    outputs = self(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 934, in forward
    transformer_outputs = self.transformer(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 830, in forward
    hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 640, in forward
    layer_ret = layer(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 544, in forward
    attention_output, kv_cache = self.self_attention(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: chatglm_attention_forward() got an unexpected keyword argument 'kv_cache'

Sorry, I cannot reproduce the second issue :disappointed_relieved:

This is my environment:


#/bin/bash

export DOCKER_IMAGE=intelanalytics/bigdl-ppml-trusted-bigdl-llm-gramine-ref:2.4.0-SNAPSHOT
export MODEL_PATH="/home/llm/models/"

sudo docker run -itd \
        --net=host \
        --cpuset-cpus="5-20" \
        --cpuset-mems="0" \
        --memory="32G" \
        --name=gc-debug \
        -v $MODEL_PATH:/ppml/models \
        --shm-size="16g" \
        $DOCKER_IMAGE

Then, docker exec -it gc-debug bash

I tried both chatglm-6b and chatglm2-6b models, and they both worked fine in my environment:

Do you use our latest intelanalytics/bigdl-ppml-trusted-bigdl-llm-gramine-ref:2.4.0-SNAPSHOT image? :eyes:

You can also add a print statement at /ppml/FastChat/fastchat/model/model_adapter.py L545 to ensure that the patched ChatGLM model adapter is used.

For the first issue, @hzjane can you help to fix this issue?

The first issue may be caused by changes in the BigDL-LLM version. To address this, You can use the latest Dockerfile to build.

Sorry, I cannot reproduce the second issue 😥

This is my environment:
#/bin/bash

export DOCKER_IMAGE=intelanalytics/bigdl-ppml-trusted-bigdl-llm-gramine-ref:2.4.0-SNAPSHOT
export MODEL_PATH="/home/llm/models/"

sudo docker run -itd \
        --net=host \
        --cpuset-cpus="5-20" \
        --cpuset-mems="0" \
        --memory="32G" \
        --name=gc-debug \
        -v $MODEL_PATH:/ppml/models \
        --shm-size="16g" \
        $DOCKER_IMAGE
Then, docker exec -it gc-debug bash

I tried both chatglm-6b and chatglm2-6b models, and they both worked fine in my environment:

Do you use our latest intelanalytics/bigdl-ppml-trusted-bigdl-llm-gramine-ref:2.4.0-SNAPSHOT image? 👀

You can also add a print statement at /ppml/FastChat/fastchat/model/model_adapter.py L545 to ensure that the patched ChatGLM model adapter is used.

For the first issue, @hzjane can you help to fix this issue?

Thanks, I have solved this problem. The reason is that I mistakenly named the model folder of chatglm2-6b as chatglm-6b before. This may cause the error in model loading. I have changed the folder name back to chatglm2-6b and it loads and runs normally. Now I can run the model and perform inference in Docker.

I noticed that BigDL accelerates the computation of inference by using 512 AVX. But when I run the Fastchat on my local host without bigdl and in the docker images with bigdl.The inference times I got are not much different. Sometimes it takes longer in the docker. For example: On my local platform this question takes 36 seconds, but in the docker images it takes 62 seconds.

Does this require additional configuration? I have chack my cpu suppport the AVX512.

Do you use the same core/mem configuration when you run in container and local machine?

Besides, please try --cpuset-cpus option instead of --cpus

We recommend to use source bigdl-nano-init -t to setup the environment, try do the following:

source bigdl-nano-init -t
# bigdl-nano-init may set wrong number of cores in container
export OMP_NUM_THREADS="YOUR_CORE_NUM"

Sorry, I cannot reproduce the second issue 😥 This is my environment:
#/bin/bash

export DOCKER_IMAGE=intelanalytics/bigdl-ppml-trusted-bigdl-llm-gramine-ref:2.4.0-SNAPSHOT
export MODEL_PATH="/home/llm/models/"

sudo docker run -itd \
        --net=host \
        --cpuset-cpus="5-20" \
        --cpuset-mems="0" \
        --memory="32G" \
        --name=gc-debug \
        -v $MODEL_PATH:/ppml/models \
        --shm-size="16g" \
        $DOCKER_IMAGE
Then, docker exec -it gc-debug bash I tried both chatglm-6b and chatglm2-6b models, and they both worked fine in my environment: Do you use our latest intelanalytics/bigdl-ppml-trusted-bigdl-llm-gramine-ref:2.4.0-SNAPSHOT image? 👀 You can also add a print statement at /ppml/FastChat/fastchat/model/model_adapter.py L545 to ensure that the patched ChatGLM model adapter is used. For the first issue, @hzjane can you help to fix this issue?
Thanks, I have solved this problem. The reason is that I mistakenly named the model folder of chatglm2-6b as chatglm-6b before. This may cause the error in model loading. I have changed the folder name back to chatglm2-6b and it loads and runs normally. Now I can run the model and perform inference in Docker. I noticed that BigDL accelerates the computation of inference by using 512 AVX. But when I run the Fastchat on my local host without bigdl and in the docker images with bigdl.The inference times I got are not much different. Sometimes it takes longer in the docker. For example: On my local platform this question takes 36 seconds, but in the docker images it takes 62 seconds. Does this require additional configuration? I have chack my cpu suppport the AVX512.

I increased the numbers of cores and memory and the inference was faster inside the container than outside. I want to learn more information about how bigdl-llm uses avx for acceleration. How is this acceleration implemented ? Where is this part of the code？ Is there any change or patching on pytorch？

I increased the numbers of cores and memory and the inference was faster inside the container than outside. I want to learn more information about how bigdl-llm uses avx for acceleration. How is this acceleration implemented ? Where is this part of the code？ Is there any change or patching on pytorch？

bigdl-llm will automatically apply intel specific optimizations (including amx/vnni/avx/etc.) to LLM; no change/patching to pytorch is needed

intel-analytics / ipex-llm

Patching failed when building the base images in trusted-bigdl-llm, and TypeError when run the Fastchat in ppml. #8961