Mixtral quantization hard-freezes Python

rosario-purple commented 8 months ago

System Info

- `transformers` version: 4.36.2
- `optimum` version: 1.16.2
- Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
- Python version: 3.10.13
- Huggingface_hub version: 0.19.4
- Safetensors version: 0.4.0
- Accelerate version: 0.26.1
- Accelerate config:    - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - fsdp_config: {'fsdp_auto_wrap_policy': 'SIZE_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_min_num_params': 100000000, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_use_orig_params': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): 0.7.5 (cpu)
- Jax version: 0.4.21
- JaxLib version: 0.4.21
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes

Who can help?

@philschmid

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Running this Python code to quantize Mixtral hard-freezes Python (it never completes, and doesn't exit with Ctrl-C, the only way to stop it is kill -9):

from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer, load_quantized_model
import torch
model_name = "mistralai/Mixtral-8x7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

quantizer = GPTQQuantizer(bits=4, dataset="wikitext2", use_cuda_fp16=True)
quantized_model = quantizer.quantize_model(model, tokenizer)

quantizer.save(model, "/scratch/brr/mixtral-gptq")

Expected behavior

The quantization should not freeze Python.

fxmarty commented 7 months ago

Thank you @rosario-purple, will have a look into it. Maybe could you share the GPU you are using, and how much RAM you have? cc @SunMarc if you have an idea.

rosario-purple commented 7 months ago

@fxmarty This server has 8xA100 80 GB GPU and 1 TB of main RAM

SunMarc commented 7 months ago

Hi @rosario-purple , i've opened a PR that should solve this issue. This hard freeze is due to the processing taking too much time (mixtral tokenizer is not fast enough when tokenizing the whole wikitext2 dataset)

rosario-purple commented 7 months ago

@SunMarc Thank you!

huggingface / optimum