Open thalapandi opened 2 months ago
Are you using a custom quantized model? I don't see it on HuggingFace.
i am using only phi-3 .5 vision instruct model and wanted to run in vllm with 4 bit quantization and one more douts i have can use engine configure for phi-3.5 model
like this """ Saves each worker's model state dict directly to a checkpoint, which enables a fast load path for large tensor-parallel models where each worker only needs to read its own shard rather than the entire checkpoint.
Example usage:
python save_sharded_state.py \ --model /path/to/load \ --quantization deepspeedfp \ --tensor-parallel-size 8 \ --output /path/to/save
Then, the model can be loaded with
llm = LLM( model="/path/to/save", load_format="sharded_state", quantization="deepspeedfp", tensor_parallel_size=8, ) """ import dataclasses import os import shutil from pathlib import Path
from vllm import LLM, EngineArgs from vllm.utils import FlexibleArgumentParser
parser = FlexibleArgumentParser() EngineArgs.add_cli_args(parser) parser.add_argument("--output", "-o", required=True, type=str, help="path to output checkpoint") parser.add_argument("--file-pattern", type=str, help="string pattern of saved filenames") parser.add_argument("--max-file-size", type=str, default=5 * 1024**3, help="max size (in bytes) of each safetensors file")
def main(args): engine_args = EngineArgs.from_cli_args(args) if engine_args.enable_lora: raise ValueError("Saving with enable_lora=True is not supported!") model_path = engine_args.model if not Path(model_path).is_dir(): raise ValueError("model path must be a local directory")
llm = LLM(**dataclasses.asdict(engine_args))
# Prepare output directory
Path(args.output).mkdir(exist_ok=True)
# Dump worker states to output directory
model_executor = llm.llm_engine.model_executor
model_executor.save_sharded_state(path=args.output,
pattern=args.file_pattern,
max_size=args.max_file_size)
# Copy metadata files to output directory
for file in os.listdir(model_path):
if os.path.splitext(file)[1] not in (".bin", ".pt", ".safetensors"):
if os.path.isdir(os.path.join(model_path, file)):
shutil.copytree(os.path.join(model_path, file),
os.path.join(args.output, file))
else:
shutil.copy(os.path.join(model_path, file), args.output)
if name == "main": args = parser.parse_args() main(args)
@Isotr0py are you familiar with this?
I'm not sure which quantization "int4 quantization" exactly means here, because seems that there is no BNB 4-bit quantized Phi3-V model released in HF. (The code given above is using deepspeedfp
quantization, which should be fp6/fp8
quantization)
If "int4 quantization" just means 4-Bit quantization, Phi-3.5-vision-instruct-AWQ with awq quantization should work on VLLM.
How many gpu is need to execute awq quantization? In vllm is it possible to run tensorrt if it is there any documentation for phi-3.5 vision instruct
It costs about 4GB VRAM to run 4-bit awq quantized Phi-3.5-vision-instruct.
BTW, the AWQ model I uploaded is calibrated with default dataset in autoawq
, because I just used it to check code consistency. You had better calibrate from source model with your custom datasets to get better quality.
I think vllm can't run tensorrt currently. (FYI, https://github.com/vllm-project/vllm/issues/5134#issuecomment-2139618073)
ok
Does this work for you?
The model to consider.
The closest model vllm already supports.
phi-3.5 vision instruct model and need reference for this
What's your difficulty of supporting the model you want?
does not contain any information about quantization for phi-3.5 vision instruct model
Before submitting a new issue...