mit-han-lab / llm-awq

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
MIT License
2.52k stars 200 forks source link

Fix memory issue when running `run_awq` #145

Closed isaac-vidas closed 8 months ago

isaac-vidas commented 8 months ago

Following up on #144.

@casper-hansen's suggestion worked so added use_cache=False when the model is created. Added this to the entry.py code when loading LLaVA model to avoid the memory issue described in the issue.

After adding this change in my environment and running the command again, it worked without any issues:

$ python -m awq.entry \
    --model_path /home/gcpuser/sky_workdir/llava-v1.5-7b \
    --w_bit 4 \
    --q_group_size 128 \
    --run_awq \
    --dump_awq /home/gcpuser/sky_workdir/awq_cache/llava-v1.5-7b-w4-g128.pt

Quantization config: {'zero_point': True, 'q_group_size': 128}
* Building model /home/gcpuser/sky_workdir/llava-v1.5-7b
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards:   0%|                                                                                                                                                                                                                          | 0/2 [00:00<?, ?it/s]/opt/conda/envs/quantize_llava/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.61it/s]
/opt/conda/envs/quantize_llava/lib/python3.10/site-packages/huggingface_hub/repocard.py:105: UserWarning: Repo card metadata block was not found. Setting CardData to empty.
  warnings.warn("Repo card metadata block was not found. Setting CardData to empty.")
Token indices sequence length is longer than the specified maximum sequence length for this model (8322 > 2048). Running this sequence through the model will result in indexing errors
 * Split into 65 blocks
Running AWQ...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [08:33<00:00, 16.04s/it]
AWQ results saved at /home/gcpuser/sky_workdir/awq_cache/llava-v1.5-7b-w4-g128.pt
ys-2020 commented 8 months ago

This PR helps to fix a potential OOM issue when searching awq quantization scales for LLaVA family models #144 . Many thanks to @isaac-vidas and @casper-hansen !

Could you please review this PR and merge it to the main branch? Thanks @kentang-mit , @tonylins , @Sakits .

isaac-vidas commented 8 months ago

I don't have permissions to merge this PR. @ys-2020 I think you can probably do it 😄