mit-han-lab / llm-awq

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
MIT License
2.52k stars 200 forks source link

`RuntimeError: probability tensor contains either `inf`, `nan` or element < 0` when running LLaVA demo #147

Open isaac-vidas opened 8 months ago

isaac-vidas commented 8 months ago

When trying to run the llava_demo.ipynb example:

/opt/conda/envs/quantize_llava/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
real weight quantization...(init only): 100%|██████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 78.04it/s]
Loading checkpoint: 100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.87s/it]
USER: What is unusual about this image?
ASSISTANT: Traceback (most recent call last):
  File "/home/gcpuser/sky_workdir/llava_example.py", line 178, in <module>
    outputs = stream_output(output_stream)
  File "/home/gcpuser/sky_workdir/llava_example.py", line 113, in stream_output
    for outputs in output_stream:
  File "/opt/conda/envs/quantize_llava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "/home/gcpuser/sky_workdir/llm-awq/tinychat/stream_generators/llava_stream_gen.py", line 204, in LlavaStreamGenerator
    token = int(torch.multinomial(probs, num_samples=1))
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

The behavior is slightly different if I also generate the quantized checkpoints as part of the notebook or not. In some combination, I also get the following warning
and in the output below

[Warning] The awq quantized checkpoint seems to be in v1 format.
If the model cannot be loaded successfully, please use the latest awq library to re-quantized the model, or repack the current checkpoint with tinychat/offline-weight-repacker.py

I've tried with a new version of transformers as well as an older version of transformers (4.32.0). Any idea on how to get the demo running?

Also, I was able to run the TheBloke/llava-v1.5-13B-AWQ AWQ version on both vllm and sglang. Is that version considered as v1 format?

Thanks in advance!

isaac-vidas commented 8 months ago

Update: I was able to run the AWQ version with the vlm_demo.py script.

python vlm_demo.py \
    --model_type llava  \
    --model-path ~/llava-v1.5-7b  \
    --quant-path ~/quant_cache/llava-v1.5-7b-w4-g128-awq-v2.pt  \
    --image-file https://llava.hliu.cc/file=/nobackup/haotian/tmp/gradio/ca10383cc943e99941ecffdc4d34c51afb2da472/extreme_ironing.jpg

I had to update the llm-awq/tinychat/utils/prompt_templates.py in order to support llava but aside from that it's working:

python vlm_demo.py \
    --model_type llava  \
    --model-path ~/llava-v1.5-7b  \
    --quant-path ~/quant_cache/llava-v1.5-7b-w4-g128-awq-v2.pt  \
    --image-file https://llava.hliu.cc/file=/nobackup/haotian/tmp/gradio/ca10383cc943e99941ecffdc4d34c51afb2da472/extreme_ironing.jpg

/opt/conda/envs/quantize_llava/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
real weight quantization...(init only): 100%|██████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 83.69it/s]
Loading checkpoint: 100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.57s/it]
==================================================
USER: what is in the picture?
--------------------------------------------------
ASSISTANT: The image features a man standing on the back of a yellow truck, holding a clothes iron. The truck is driving down a busy city street, with other vehicles such as a taxi and a car visible in the scene. The man appears to be ironing clothes while riding in the back of the truck.
==================================================
USER: where has this picture been taken?
--------------------------------------------------
ASSISTANT: This picture has been taken in a busy city street, with various vehicles and pedestrians present.
kentang-mit commented 8 months ago

Hi @isaac-vidas,

We've changed the weight packing format in our latest PR. This PR significantly improves the context stage and decoding latency of TinyChat. As a result, weights generated with commits prior to this PR need to be re-packed. Shang has implemented a script for this. Therefore, I believe the first error you saw is related to weight packing format. Cc @ys-2020.

Best, Haotian