intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.69k stars 1.26k forks source link

IPEX-LLM with Langchain-chatchat runs into httpcore.RemoteProtocolError in MTL with iGPU #11259

Open zcwang opened 5 months ago

zcwang commented 5 months ago

Hello Sir, I use langchain-chatchat via iGPU for chatglm3-6b LLM running in my MTL 155H and it is suffering issue.

2024-06-07 13:50:11,037 - utils.py[line:38] - ERROR: peer closed connection without sending complete message body (incomplete chunked read)
Traceback (most recent call last):
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpx/_transports/default.py", line 67, in map_httpcore_exceptions
    yield
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpx/_transports/default.py", line 252, in __aiter__
    async for part in self._httpcore_stream:
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 367, in __aiter__
    raise exc from None
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 363, in __aiter__
    async for part in self._stream:
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/http11.py", line 349, in __aiter__
    raise exc
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/http11.py", line 341, in __aiter__
    async for chunk in self._connection._receive_response_body(**kwargs):
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/http11.py", line 210, in _receive_response_body
    event = await self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/http11.py", line 220, in _receive_event
    with map_exceptions({h11.RemoteProtocolError: RemoteProtocolError}):
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)

Provided genreated logs,

Test Environment:

(mytest) intel@mydevice:~/work/Langchain-Chatchat/logs$ python -c "from openvino import Core; print(Core().available_devices);"
['CPU', 'GPU', 'NPU']
(mytest) intel@mydevice:~/work/Langchain-Chatchat/logs$ python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"
OPENMP DISPLAY ENVIRONMENT END
2.1.0.post0+cxx11.abi
2.1.20+git0e2bee2
[0]: _DeviceProperties(name='Intel(R) Arc(TM) Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=59466MB, max_compute_units=128, gpu_eu_count=128)
(mytest) intel@mydevice:~/work/Langchain-Chatchat/logs$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.3.0.08_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 7 155H OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO  [24.13.29138.7]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.29138]

BTW the warmup.py works well with iGPU...

(mytest) intel@mydevice:~/work/Langchain-Chatchat$ python warmup.py
2024-06-07 14:05:32,097 - INFO - intel_extension_for_pytorch auto imported
>> NOTE: The one-time warmup may take several minutes. Please be patient until it finishes warm-up...
---------------  Start warming-up LLM chatglm3-6b on MTL iGPU  ---------------
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 11.85it/s]
2024-06-07 14:05:32,770 - INFO - Converting the current model to sym_int4 format......
2024-06-07 14:05:38,036 - WARNING - Setting eos_token is not supported, use the default one.
2024-06-07 14:05:38,036 - WARNING - Setting pad_token is not supported, use the default one.
2024-06-07 14:05:38,036 - WARNING - Setting unk_token is not supported, use the default one.
---------------  Warming-up of LLM chatglm3-6b on MTL iGPU is completed (1/4)  ---------------
---------------  Start warming-up embedding model bge-large-zh-v1.5 on MTL iGPU  ---------------
/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
2024-06-07 14:05:44,494 - INFO - Converting the current model to fp16 format......
/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
---------------  Warming-up of embedding model bge-large-zh-v1.5 on MTL iGPU is completed (3/4)  ---------------

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: adl [Intel(R) Core(TM) Ultra 7 155H]
Registry and code: 13 MB
Command: python warmup.py
Uptime: 14.982780 s
Oscilloscope98 commented 5 months ago

Hi @zcwang, we are reproducing this issue and will let you know when there are any updates :)

Oscilloscope98 commented 5 months ago

Hi @zcwang,

This is due to the fact that the one-time wram-up for sycl-cache on MTL iGPU for Linux not actually take effects.

We just updated the Langchain-Chatchat Setup Guide for Linux with Intel Core Ultra integrated GPU, you could follow this guide and have a try again with our latest ipex-llm (>=2.1.0b20240612).

Please note that instead of a one-time warmup on MTL iGPU for Windows, for MTL iGPU on Linux, the warmup of LLM model will be conducted when you start the first conversation. And the warmup of embedding model may happen either when you create a knowledage base or you start the first Knowledge Base QA/File Chat conversation. Thus, please expect a several-minute warmup time during your first conversation with a LLM model, or when you create a new knowledge base with an embedding model.

Please let us know for any further problems :)

zcwang commented 4 months ago

@Oscilloscope98 , I skipped warmup phase via directly running "python startup.py -a" with following environment settings but still got failed.

...
export SYCL_CACHE_PERSISTENT=1
export BIGDL_LLM_XMX_DISABLED=1
export BIGDL_IMPORT_IPEX=0
export no_proxy=localhost,127.0.0.1
export FASTCHAT_WORKER_API_TIMEOUT=600
...

Here is coming the error log.

...
2024-06-20 14:32:51 | INFO | model_worker | Loading the model ['chatglm3-6b'] on worker bcb2cd49, worker type: BigDLLLM worker...
2024-06-20 14:32:51 | INFO | model_worker | Using low bit format: sym_int4, device: xpu
2024-06-20 14:32:51 | WARNING | transformers_modules.chatglm3-6b.tokenization_chatglm | Setting eos_token is not supported, use the default one.
2024-06-20 14:32:51 | WARNING | transformers_modules.chatglm3-6b.tokenization_chatglm | Setting pad_token is not supported, use the default one.
2024-06-20 14:32:51 | WARNING | transformers_modules.chatglm3-6b.tokenization_chatglm | Setting unk_token is not supported, use the default one.
Loading checkpoint shards:   0%|                                                                                                                                       | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards:  14%|██████████████████▏                                                                                                            | 1/7 [00:00<00:00,  6.03it/s]
Loading checkpoint shards:  29%|████████████████████████████████████▎                                                                                          | 2/7 [00:00<00:00,  6.10it/s]
Loading checkpoint shards:  43%|██████████████████████████████████████████████████████▍                                                                        | 3/7 [00:00<00:00,  6.22it/s]
Loading checkpoint shards:  57%|████████████████████████████████████████████████████████████████████████▌                                                      | 4/7 [00:00<00:00,  6.24it/s]
Loading checkpoint shards:  71%|██████████████████████████████████████████████████████████████████████████████████████████▋                                    | 5/7 [00:00<00:00,  6.22it/s]
Loading checkpoint shards:  86%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                  | 6/7 [00:00<00:00,  6.24it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  6.36it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  6.27it/s]
2024-06-20 14:32:53 | ERROR | stderr |
2024-06-20 14:32:53 | INFO | ipex_llm.transformers.utils | Converting the current model to sym_int4 format......
2024-06-20 14:33:30 | INFO | stdout | Convert model to half precision...
2024-06-20 14:33:31 | ERROR | stderr | /home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecat
ed and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
2024-06-20 14:33:31 | ERROR | stderr |   warnings.warn(
2024-06-20 14:33:32 | INFO | stdout | <class 'transformers_modules.chatglm3-6b.modeling_chatglm.ChatGLMForConditionalGeneration'>
2024-06-20 14:33:32 | INFO | model_worker | enable benchmark successfully
2024-06-20 14:33:32 | INFO | model_worker | Register to controller
...
2024-06-20 14:37:07,604 - _client.py[line:1027] - INFO: HTTP Request: POST http://127.0.0.1:7861/chat/knowledge_base_chat "HTTP/1.1 200 OK"
/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.chat_models.openai.ChatOp
enAI` was deprecated in langchain-community 0.0.10 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it
 run `pip install -U langchain-openai` and import as `from langchain_openai import ChatOpenAI`.
  warn_deprecated(
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.74it/s]
2024-06-20 14:37:08 | INFO | stdout | INFO:     127.0.0.1:50504 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2024-06-20 14:37:08,423 - _client.py[line:1758] - INFO: HTTP Request: POST http://127.0.0.1:20000/v1/chat/completions "HTTP/1.1 200 OK"
2024-06-20 14:37:08 | INFO | httpx | HTTP Request: POST http://127.0.0.1:20002/worker_generate_stream "HTTP/1.1 200 OK"
/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:392: UserWarning: `do_sample` is set to `False`. However, `temperature` is set
 to `0.7` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:407: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `1
` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`.
  warnings.warn(
LLVM ERROR: Diag: aborted
LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: adl [Intel(R) Core(TM) Ultra 7 155H]
Registry and code: 13 MB
Command: /home/intel/miniconda3/envs/mytest/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=112, pipe_handle=119) --multiprocessing-fork
Uptime: 269.896746 s
2024-06-20 14:37:20 | ERROR | stderr | ERROR:    Exception in ASGI application
2024-06-20 14:37:20 | ERROR | stderr | Traceback (most recent call last):
2024-06-20 14:37:20 | ERROR | stderr |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/responses.py", line 261, in __call__
2024-06-20 14:37:20 | ERROR | stderr |     await wrap(partial(self.listen_for_disconnect, receive))
2024-06-20 14:37:20 | ERROR | stderr |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/responses.py", line 257, in wrap
2024-06-20 14:37:20 | ERROR | stderr |     await func()
2024-06-20 14:37:20 | ERROR | stderr |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/responses.py", line 234, in listen_for_disconnect
2024-06-20 14:37:20 | ERROR | stderr |     message = await receive()
2024-06-20 14:37:20 | ERROR | stderr |               ^^^^^^^^^^^^^^^
2024-06-20 14:37:20 | ERROR | stderr |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 535, in receive
2024-06-20 14:37:20 | ERROR | stderr |     await self.message_event.wait()
2024-06-20 14:37:20 | ERROR | stderr |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/asyncio/locks.py", line 213, in wait
2024-06-20 14:37:20 | ERROR | stderr |     await fut
2024-06-20 14:37:20 | ERROR | stderr | asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fa121b02810
2024-06-20 14:37:20 | ERROR | stderr |
2024-06-20 14:37:20 | ERROR | stderr | During handling of the above exception, another exception occurred:
2024-06-20 14:37:20 | ERROR | stderr |
2024-06-20 14:37:20 | ERROR | stderr |   + Exception Group Traceback (most recent call last):
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 407, in run_asgi
2024-06-20 14:37:20 | ERROR | stderr |   |     result = await app(  # type: ignore[func-returns-value]
2024-06-20 14:37:20 | ERROR | stderr |   |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
2024-06-20 14:37:20 | ERROR | stderr |   |     return await self.app(scope, receive, send)
2024-06-20 14:37:20 | ERROR | stderr |   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
2024-06-20 14:37:20 | ERROR | stderr |   |     await super().__call__(scope, receive, send)
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/applications.py", line 119, in __call__
2024-06-20 14:37:20 | ERROR | stderr |   |     await self.middleware_stack(scope, receive, send)
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
2024-06-20 14:37:20 | ERROR | stderr |   |     raise exc
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
2024-06-20 14:37:20 | ERROR | stderr |   |     await self.app(scope, receive, _send)
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/middleware/cors.py", line 83, in __call__
2024-06-20 14:37:20 | ERROR | stderr |   |     await self.app(scope, receive, send)
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
2024-06-20 14:37:20 | ERROR | stderr |   |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
2024-06-20 14:37:20 | ERROR | stderr |   |     raise exc
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
2024-06-20 14:37:20 | ERROR | stderr |   |     await app(scope, receive, sender)
2024-06-20 14:37:20 | ERROR | stderr |   |   File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/starlette/routing.py", line 762, in __call__
...
2024-06-20 14:37:20 | ERROR | stderr |     +------------------------------------
2024-06-20 14:37:20,032 - utils.py[line:38] - ERROR: peer closed connection without sending complete message body (incomplete chunked read)
Traceback (most recent call last):
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpx/_transports/default.py", line 67, in map_httpcore_exceptions
    yield
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpx/_transports/default.py", line 252, in __aiter__
    async for part in self._httpcore_stream:
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 367, in __aiter__
    raise exc from None
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/connection_pool.py", line 363, in __aiter__
    async for part in self._stream:
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/http11.py", line 349, in __aiter__
    raise exc
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/http11.py", line 341, in __aiter__
    async for chunk in self._connection._receive_response_body(**kwargs):
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/http11.py", line 210, in _receive_response_body
    event = await self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_async/http11.py", line 220, in _receive_event
    with map_exceptions({h11.RemoteProtocolError: RemoteProtocolError}):
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/intel/miniconda3/envs/mytest/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)
...
Oscilloscope98 commented 4 months ago

Hi @zcwang,

Please make sure you have create a new conda environment with latest ipex-llm (>=2.1.0b20240612), and use the latest Langchain-Chatchat repo :)