Open gyin94 opened 2 months ago
cc @youkaichao
cc @ywang96 @DarkLight1337
can we download these code when we download the model?
in https://docs.vllm.ai/en/latest/getting_started/debugging.html , I recommend users to use huggingface-cli
to download models first. if possible, we can also recommend users to download these scripts too.
downloading and loading code at runtime, is quite complicated, and can break distributed inference with multi-gpu or multi-node easily.
Would it be ok if we download this at the model loading stage? Currently, it's done at the profiling stage after the model runners are initialized, which may be causing the problem.
@gyin94 does it occur in just multi-gpu inference, or in multi-node inference?
it happened for single node multi-gpu as well. I used what you suggested to preload the processor before running the vllm serve. the error is gone. though it would be better to solve it like tokenizers or models loading in vllm serve
preload the processor before running the vllm serve
can you give more details on how to do this? we can add it to the doc.
my solution needs a separate python script to run before running the vllm serve.
model_dir = "microsoft/Phi-3.5-vision-instruct"
# config is from AutoConfig.from_pretrained
if config.model_type == "phi3_v":
# A temporary fix for phi-3-vision. Load and cache processor.
from transformers import AutoProcessor
AutoProcessor.from_pretrained(model_dir, trust_remote_code=True, num_crops=4)
Your current environment
vllm == 0.5.5.
🐛 Describe the bug
when we deploy the
microsoft/Phi-3.5-vision-instruct
,it will randomly hit this issue.
the problem might be caused by this https://github.com/vllm-project/vllm/blob/80162c44b1d1e59a2c10f65b6adb9b0407439b1f/vllm/multimodal/image.py#L16 for multi gpus environment that head one hasn't yet finished downloading. is it better to put it where AutoTokenizer is run?
Before submitting a new issue...