Closed avinashkarani closed 2 months ago
Bypassed the above issue by manually adding setuptools to PIP. But openVINO NNFC is failing to run.
bash launch_model_server.sh -m "OpenVINO/TinyLlama-1.1B-Chat-v1.0-int8-ov"
WARNING 06-20 17:35:32 ray_utils.py:62] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For distributed inference,
please install Ray with pip install ray
. INFO 06-20 17:35:36 api_server.py:241] vLLM API server version 0.3.3
INFO 06-20 17:35:36 api_server.py:242] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_ori
gins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='OpenVINO/TinyLlama-1.1B-Chat-v1.0-int8-ov', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, seed=0, swap_space=50, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', engine_use_ray=False, disable_log_requests=True, max_log_len=None) INFO 06-20 17:35:36 llm_engine.py:67] Initializing an LLM engine (v0.3.3) with config: model='OpenVINO/TinyLlama-1.1B-Chat-v1.0-int8-ov', t
okenizer='OpenVINO/TinyLlama-1.1B-Chat-v1.0-int8-ov', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cpu, seed=0) OpenVINO Tokenizer version is not compatible with OpenVINO version. Installed OpenVINO version: 2024.3.0,OpenVINO Tokenizers requires . Ope
nVINO Tokenizers models will not be added during export. INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 250, in query
input should be 2, but it is 3.
[ INFO ] OpenVINO IR is avaialble for provided model id OpenVINO/TinyLlama-1.1B-Chat-v1.0-int8-ov. This IR will be used for inference as-is , all possible options that may affect model conversion are ignored. TRANSFORMING OPTIMUM-INTEL MODEL TO vLLM COMPATIBLE FORM
This forked repo is no longed maintained by OpenVINO (OV) team. We had a sync up with the OV team and they already have created a PR to the vllm mainline. Once the PR is merged, I will modify the genAIComps with updated changes.
vLLM-openVINO is failing to build