IPEX-LLM with Langchain-chatchat runs into httpcore.RemoteProtocolError in MTL with iGPU

Hello Sir, I use langchain-chatchat via iGPU for chatglm3-6b LLM running in my MTL 155H and it is suffering issue.

2024-06-07 13:50:11,037 - utils.py[line:38] - ERROR: peer closed connection without sending complete message body (incomplete chunked read)
Traceback (most recent call last):
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpx/_transports/default.py", line 67, in map_httpcore_exceptions
    yield
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpx/_transports/default.py", line 252, in __aiter__
    async for part in self._httpcore_stream:
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 367, in __aiter__
    raise exc from None
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 363, in __aiter__
    async for part in self._stream:
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/http11.py", line 349, in __aiter__
    raise exc
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/http11.py", line 341, in __aiter__
    async for chunk in self._connection._receive_response_body(**kwargs):
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/http11.py", line 210, in _receive_response_body
    event = await self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/http11.py", line 220, in _receive_event
    with map_exceptions({h11.RemoteProtocolError: RemoteProtocolError}):
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)

Provided genreated logs,

Test Environment:

Ubuntu 22.04+Kernel 6.8.2
NPU works well
AI mode: chatglm3-6b
There is no issue with ARC770 (without configuration change, just plug-in ARC770)
export BIGDL_IMPORT_IPEX=0
export SYCL_CACHE_PERSISTENT=1
export BIGDL_LLM_XMX_DISABLED=1
export TOKENIZERS_PARALLELISM=false
RAG code by Langchain-chatchat git: https://github.com/intel-analytics/Langchain-Chatchat/tree/ipex-llm
git commit: 7a96a6c "Update installation related guide to adapt to simplified installation process on Windows"

(mytest) intel@mydevice:~/work/Langchain-Chatchat/logs$ python -c "from openvino import Core; print(Core().available_devices);"
['CPU', 'GPU', 'NPU']
(mytest) intel@mydevice:~/work/Langchain-Chatchat/logs$ python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"
OPENMP DISPLAY ENVIRONMENT END
2.1.0.post0+cxx11.abi
2.1.20+git0e2bee2
[0]: _DeviceProperties(name='Intel(R) Arc(TM) Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=59466MB, max_compute_units=128, gpu_eu_count=128)
(mytest) intel@mydevice:~/work/Langchain-Chatchat/logs$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.3.0.08_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 7 155H OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO  [24.13.29138.7]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.29138]

BTW the warmup.py works well with iGPU...

(mytest) intel@mydevice:~/work/Langchain-Chatchat$ python warmup.py
2024-06-07 14:05:32,097 - INFO - intel_extension_for_pytorch auto imported
>> NOTE: The one-time warmup may take several minutes. Please be patient until it finishes warm-up...
---------------  Start warming-up LLM chatglm3-6b on MTL iGPU  ---------------
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 11.85it/s]
2024-06-07 14:05:32,770 - INFO - Converting the current model to sym_int4 format......
2024-06-07 14:05:38,036 - WARNING - Setting eos_token is not supported, use the default one.
2024-06-07 14:05:38,036 - WARNING - Setting pad_token is not supported, use the default one.
2024-06-07 14:05:38,036 - WARNING - Setting unk_token is not supported, use the default one.
---------------  Warming-up of LLM chatglm3-6b on MTL iGPU is completed (1/4)  ---------------
---------------  Start warming-up embedding model bge-large-zh-v1.5 on MTL iGPU  ---------------
/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
2024-06-07 14:05:44,494 - INFO - Converting the current model to fp16 format......
/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
---------------  Warming-up of embedding model bge-large-zh-v1.5 on MTL iGPU is completed (3/4)  ---------------

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: adl [Intel(R) Core(TM) Ultra 7 155H]
Registry and code: 13 MB
Command: python warmup.py
Uptime: 14.982780 s

@Oscilloscope98 , I skipped warmup phase via directly running "python startup.py -a" with following environment settings but still got failed.

...
export SYCL_CACHE_PERSISTENT=1
export BIGDL_LLM_XMX_DISABLED=1
export BIGDL_IMPORT_IPEX=0
export no_proxy=localhost,127.0.0.1
export FASTCHAT_WORKER_API_TIMEOUT=600
...

Here is coming the error log.

...
2024-06-20 14:32:51 | INFO | model_worker | Loading the model ['chatglm3-6b'] on worker bcb2cd49, worker type: BigDLLLM worker...
2024-06-20 14:32:51 | INFO | model_worker | Using low bit format: sym_int4, device: xpu
2024-06-20 14:32:51 | WARNING | transformers_modules.chatglm3-6b.tokenization_chatglm | Setting eos_token is not supported, use the default one.
2024-06-20 14:32:51 | WARNING | transformers_modules.chatglm3-6b.tokenization_chatglm | Setting pad_token is not supported, use the default one.
2024-06-20 14:32:51 | WARNING | transformers_modules.chatglm3-6b.tokenization_chatglm | Setting unk_token is not supported, use the default one.
Loading checkpoint shards:   0%|                                                                                                                                       | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards:  14%|██████████████████▏                                                                                                            | 1/7 [00:00<00:00,  6.03it/s]
Loading checkpoint shards:  29%|████████████████████████████████████▎                                                                                          | 2/7 [00:00<00:00,  6.10it/s]
Loading checkpoint shards:  43%|██████████████████████████████████████████████████████▍                                                                        | 3/7 [00:00<00:00,  6.22it/s]
Loading checkpoint shards:  57%|████████████████████████████████████████████████████████████████████████▌                                                      | 4/7 [00:00<00:00,  6.24it/s]
Loading checkpoint shards:  71%|██████████████████████████████████████████████████████████████████████████████████████████▋                                    | 5/7 [00:00<00:00,  6.22it/s]
Loading checkpoint shards:  86%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                  | 6/7 [00:00<00:00,  6.24it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  6.36it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  6.27it/s]
2024-06-20 14:32:53 | ERROR | stderr |
2024-06-20 14:32:53 | INFO | ipex_llm.transformers.utils | Converting the current model to sym_int4 format......
2024-06-20 14:33:30 | INFO | stdout | Convert model to half precision...
2024-06-20 14:33:31 | ERROR | stderr | /home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecat
ed and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
2024-06-20 14:33:31 | ERROR | stderr |   warnings.warn(
2024-06-20 14:33:32 | INFO | stdout | <class 'transformers_modules.chatglm3-6b.modeling_chatglm.ChatGLMForConditionalGeneration'>
2024-06-20 14:33:32 | INFO | model_worker | enable benchmark successfully
2024-06-20 14:33:32 | INFO | model_worker | Register to controller
...
2024-06-20 14:37:07,604 - _client.py[line:1027] - INFO: HTTP Request: POST http://127.0.0.1:7861/chat/knowledge_base_chat "HTTP/1.1 200 OK"
/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.chat_models.openai.ChatOp
enAI` was deprecated in langchain-community 0.0.10 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it
 run `pip install -U langchain-openai` and import as `from langchain_openai import ChatOpenAI`.
  warn_deprecated(
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.74it/s]
2024-06-20 14:37:08 | INFO | stdout | INFO:     127.0.0.1:50504 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2024-06-20 14:37:08,423 - _client.py[line:1758] - INFO: HTTP Request: POST http://127.0.0.1:20000/v1/chat/completions "HTTP/1.1 200 OK"
2024-06-20 14:37:08 | INFO | httpx | HTTP Request: POST http://127.0.0.1:20002/worker_generate_stream "HTTP/1.1 200 OK"
/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:392: UserWarning: `do_sample` is set to `False`. However, `temperature` is set
 to `0.7` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:407: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `1
` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`.
  warnings.warn(
LLVM ERROR: Diag: aborted
LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: adl [Intel(R) Core(TM) Ultra 7 155H]
Registry and code: 13 MB
Command: /home/intel/miniconda3/envs/mytest/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=112, pipe_handle=119) --multiprocessing-fork
Uptime: 269.896746 s
2024-06-20 14:37:20 | ERROR | stderr | ERROR:    Exception in ASGI application
2024-06-20 14:37:20 | ERROR | stderr | Traceback (most recent call last):
2024-06-20 14:37:20 | ERROR | stderr |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/responses.py", line 261, in __call__
2024-06-20 14:37:20 | ERROR | stderr |     await wrap(partial(self.listen_for_disconnect, receive))
2024-06-20 14:37:20 | ERROR | stderr |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/responses.py", line 257, in wrap
2024-06-20 14:37:20 | ERROR | stderr |     await func()
2024-06-20 14:37:20 | ERROR | stderr |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/responses.py", line 234, in listen_for_disconnect
2024-06-20 14:37:20 | ERROR | stderr |     message = await receive()
2024-06-20 14:37:20 | ERROR | stderr |               ^^^^^^^^^^^^^^^
2024-06-20 14:37:20 | ERROR | stderr |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 535, in receive
2024-06-20 14:37:20 | ERROR | stderr |     await self.message_event.wait()
2024-06-20 14:37:20 | ERROR | stderr |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/asyncio/locks.py", line 213, in wait
2024-06-20 14:37:20 | ERROR | stderr |     await fut
2024-06-20 14:37:20 | ERROR | stderr | asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fa121b02810
2024-06-20 14:37:20 | ERROR | stderr |
2024-06-20 14:37:20 | ERROR | stderr | During handling of the above exception, another exception occurred:
2024-06-20 14:37:20 | ERROR | stderr |
2024-06-20 14:37:20 | ERROR | stderr |   + Exception Group Traceback (most recent call last):
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 407, in run_asgi
2024-06-20 14:37:20 | ERROR | stderr |   |     result = await app(  # type: ignore[func-returns-value]
2024-06-20 14:37:20 | ERROR | stderr |   |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
2024-06-20 14:37:20 | ERROR | stderr |   |     return await self.app(scope, receive, send)
2024-06-20 14:37:20 | ERROR | stderr |   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
2024-06-20 14:37:20 | ERROR | stderr |   |     await super().__call__(scope, receive, send)
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/applications.py", line 119, in __call__
2024-06-20 14:37:20 | ERROR | stderr |   |     await self.middleware_stack(scope, receive, send)
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
2024-06-20 14:37:20 | ERROR | stderr |   |     raise exc
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
2024-06-20 14:37:20 | ERROR | stderr |   |     await self.app(scope, receive, _send)
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/middleware/cors.py", line 83, in __call__
2024-06-20 14:37:20 | ERROR | stderr |   |     await self.app(scope, receive, send)
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
2024-06-20 14:37:20 | ERROR | stderr |   |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
2024-06-20 14:37:20 | ERROR | stderr |   |     raise exc
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
2024-06-20 14:37:20 | ERROR | stderr |   |     await app(scope, receive, sender)
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/routing.py", line 762, in __call__
...
2024-06-20 14:37:20 | ERROR | stderr |     +------------------------------------
2024-06-20 14:37:20,032 - utils.py[line:38] - ERROR: peer closed connection without sending complete message body (incomplete chunked read)
Traceback (most recent call last):
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpx/_transports/default.py", line 67, in map_httpcore_exceptions
    yield
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpx/_transports/default.py", line 252, in __aiter__
    async for part in self._httpcore_stream:
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 367, in __aiter__
    raise exc from None
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 363, in __aiter__
    async for part in self._stream:
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/http11.py", line 349, in __aiter__
    raise exc
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/http11.py", line 341, in __aiter__
    async for chunk in self._connection._receive_response_body(**kwargs):
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/http11.py", line 210, in _receive_response_body
    event = await self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/http11.py", line 220, in _receive_event
    with map_exceptions({h11.RemoteProtocolError: RemoteProtocolError}):
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)
...

intel-analytics / ipex-llm

IPEX-LLM with Langchain-chatchat runs into httpcore.RemoteProtocolError in MTL with iGPU #11259