Closed appoose closed 45 minutes ago
@movchan74 thoughts?
Also the async engine works really well since they seperated out the server and api engine.
For example, much easier to generate streaming behaviour
eg:
engine_args=AsyncEngineArgs(model="Qwen/Qwen2-VL-2B-Instruct",
limit_mm_per_prompt={'image': 3, 'video': 3},
gpu_memory_utilization=0.9)
vllm_vl_engine = AsyncLLMEngine.from_engine_args(engine_args)
min_pixels = 224*224
max_pixels = 1024*1024
vl_model_processor = Qwen2VLProcessor.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct",
min_pixels=min_pixels,
max_pixels=max_pixels
)
outputs_generator = vllm_vl_engine.generate(
prompt={"prompt": text, "multi_modal_data": mm_data},
sampling_params=SamplingParams(max_tokens=1024, temperature=0.0),
request_id=str(uuid.uuid4()),
)
already_generated = 0
async for output in outputs_generator:
generated_so_far = already_generated
already_generated = len(output.outputs[0].text)
yield output.outputs[0].text[generated_so_far:]`
We are already using 0.6. The main branch requires vllm>=0.6.1.post2 and poetry lock set to 0.6.2. I can update the poetry lock to 0.6.3 but we will not have any significant performance boost since we are already on 0.6.
Also, we have been using async API since the beginning.
Ok, then let us keep as it is. I will close the ticket
On Wed 23. Oct 2024 at 09:30, Aleksandr Movchan @.***> wrote:
We are already using 0.6. The main branch requires vllm>=0.6.1.post2 and poetry lock set to 0.6.2. I can update the poetry lock to 0.6.3 but we will not have any significant performance boost since we are already on 0.6.
Also, we have been using async API since the beginning.
— Reply to this email directly, view it on GitHub https://github.com/mobiusml/aana_sdk/issues/188#issuecomment-2431143053, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJL5SLZMOEOVVUPK6KSMQTZ45F7ZAVCNFSM6AAAAABQN6CXYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZRGE2DGMBVGM . You are receiving this because you were assigned.Message ID: @.***>
I am also assuming the release still is below 0.6 ~ and will be upgraded in the next release.
vLLM version 0.60 is claiming a major uplift in speed and performance https://blog.vllm.ai/2024/09/05/perf-update.html . This is consistent to my observation in a runs in a100 instances in vastai and it supports multimodal better ( i.e. no particular version of transformers for models like Qwen-VL ). So suggest we upgrade to vllm-0.63 ( the latest pypi release at the time of writing or more )