opea-project / GenAIComps

GenAI components at micro-service level; GenAI service composer to create mega-service
Apache License 2.0
41 stars 93 forks source link

unable to build VLLM-openVINO component #214

Closed avinashkarani closed 2 months ago

avinashkarani commented 2 months ago

image

vLLM-openVINO is failing to build

avinashkarani commented 2 months ago

Bypassed the above issue by manually adding setuptools to PIP. But openVINO NNFC is failing to run. bash launch_model_server.sh -m "OpenVINO/TinyLlama-1.1B-Chat-v1.0-int8-ov" WARNING 06-20 17:35:32 ray_utils.py:62] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For distributed inference, please install Ray with pip install ray. INFO 06-20 17:35:36 api_server.py:241] vLLM API server version 0.3.3 INFO 06-20 17:35:36 api_server.py:242] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_ori gins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='OpenVINO/TinyLlama-1.1B-Chat-v1.0-int8-ov', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, seed=0, swap_space=50, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', engine_use_ray=False, disable_log_requests=True, max_log_len=None) INFO 06-20 17:35:36 llm_engine.py:67] Initializing an LLM engine (v0.3.3) with config: model='OpenVINO/TinyLlama-1.1B-Chat-v1.0-int8-ov', t okenizer='OpenVINO/TinyLlama-1.1B-Chat-v1.0-int8-ov', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cpu, seed=0) OpenVINO Tokenizer version is not compatible with OpenVINO version. Installed OpenVINO version: 2024.3.0,OpenVINO Tokenizers requires . Ope nVINO Tokenizers models will not be added during export. INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 250, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 344, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 309, in init self.engine = self._init_engine(args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 415, in _init_engine return engine_class(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 101, in init self.model_executor = executor_class(model_config, cache_config, File "/usr/local/lib/python3.10/dist-packages/vllm/executor/openvino_executor.py", line 481, in init self._init_worker() File "/usr/local/lib/python3.10/dist-packages/vllm/executor/openvino_executor.py", line 501, in _init_worker self.driver_worker.load_model() File "/usr/local/lib/python3.10/dist-packages/vllm/executor/openvino_executor.py", line 261, in load_model self.model_runner.load_model() File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 89, in load_model self.model = get_model(self.model_config, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/utils.py", line 53, in get_model return get_model_fn(model_config, device_config, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/openvino_model_loader.py", line 612, in get_model patch_stateful_model(pt_model.model, kv_cache_dtype, device_config.device.type == "cpu") File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/openvino_model_loader.py", line 365, in patch_stateful_model m.run_passes(model) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/openvino_model_loader.py", line 227, in callback paged_attention = ov.runtime.op._PagedAttentionExtension(arguments_as_outputs([ RuntimeError: Check 'get_input_partial_shape(0).rank().is_dynamic() || get_input_partial_shape(0).rank().get_length() == 2' failed at src/c ore/src/op/paged_attention.cpp:25: While validating node 'extension::PagedAttentionExtension PagedAttentionExtension_33455 (opset1::Reshape Reshape_33433[0]:f32[?,?,2048], op set1::Reshape Reshape_33437[0]:f32[?,?,256], opset1::Reshape Reshape_33441[0]:f32[?,?,256], opset1::Parameter Parameter_33428[0]:bf16[?,?,?,?], opset1::Parameter Parameter_33429[0]:bf16[?,?,?,?], opset1::Parameter is_prompt[0]:boolean[], opset1::Parameter slot_mapping[0]:i64[?,?], opset1::Parameter max_context_len[0]:i64[], opset1::Parameter context_lens[0]:i64[?], opset1::Parameter block_tables[0]:i32[?,?], opset1::Divide Divide_33452[0]:f32[], opset1::Constant Constant_33454[0]:f32[0], opset1::Constant Constant_33157[0]:i32[]) -> (dynamic[...])' with friendly_name 'PagedAttentionExtension_33455': Rank of query input should be 2, but it is 3.

[ INFO ] OpenVINO IR is avaialble for provided model id OpenVINO/TinyLlama-1.1B-Chat-v1.0-int8-ov. This IR will be used for inference as-is , all possible options that may affect model conversion are ignored. TRANSFORMING OPTIMUM-INTEL MODEL TO vLLM COMPATIBLE FORM

zahidulhaque commented 2 months ago

This forked repo is no longed maintained by OpenVINO (OV) team. We had a sync up with the OV team and they already have created a PR to the vllm mainline. Once the PR is merged, I will modify the genAIComps with updated changes.