RuntimeError: XPU out of memory.

rohitpreddy07 commented 2 weeks ago

Describe the issue

I am trying to replicate the following : https://intel.github.io/intel-extension-for-pytorch/llm/llama3/xpu/ . While running the python run_generation_gpu_woq_for_llama.py script with the --benchmark , accuracy, --trust_remote_code and profile_token_latency flags, I am running into the following error.

2024-07-06 15:47:55,979 - datasets - INFO - PyTorch version 2.1.0a0+git04048c2 available.
---- Prompt size: 31
Loading checkpoint shards:  88%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                  | 7/8 [00:57<00:08,  8.23s/it]
Traceback (most recent call last):
  File "C:\Users\rohit\intel-extension-for-pytorch\examples\gpu\inference\python\llm\run_generation_gpu_woq_for_llama.py", line 218, in <module>
    user_model = AutoModelForCausalLM.from_pretrained(
  File "C:\Users\rohit\miniconda3\envs\llm\lib\site-packages\intel_extension_for_transformers\transformers\modeling\modeling_auto.py", line 983, in from_pretrained
    model = cls.ORIG_MODEL.from_pretrained(
  File "C:\Users\rohit\miniconda3\envs\llm\lib\site-packages\transformers\models\auto\auto_factory.py", line 559, in from_pretrained
    return model_class.from_pretrained(
  File "C:\Users\rohit\miniconda3\envs\llm\lib\site-packages\transformers\modeling_utils.py", line 3838, in from_pretrained
    ) = cls._load_pretrained_model(
  File "C:\Users\rohit\miniconda3\envs\llm\lib\site-packages\transformers\modeling_utils.py", line 4298, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "C:\Users\rohit\miniconda3\envs\llm\lib\site-packages\transformers\modeling_utils.py", line 895, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "C:\Users\rohit\miniconda3\envs\llm\lib\site-packages\accelerate\utils\modeling.py", line 404, in set_module_tensor_to_device
    new_value = value.to(device)
RuntimeError: XPU out of memory. Tried to allocate 1.16 GiB (GPU 0; 14.42 GiB total capacity; 13.14 GiB already allocated; 13.14 GiB reserved in total by PyTorch)

Is there a way I can perhaps offload the rest of the workload to the cpu? Or are there any other solutions?

feng-intel commented 1 week ago

Hi What's the platform?

rohitpreddy07 commented 1 week ago

I'm using a Lenovo Yoga Slim 7 with Intel Core Ultra 7 and 32GB RAM

feng-intel commented 1 week ago

Do you want run llama3 on iGPU? What's the Memory size of iGPU?

rohitpreddy07 commented 1 week ago

I tried running Qwen/Qwen-7B-Chat model on the iGPU which has a shared memory space of 15.8 GB, but while loading the safetensor files I am running out of space. But I do eventually want to run llama3 on it.

feng-intel commented 1 week ago

Did you try int4 ? https://intel.github.io/intel-extension-for-pytorch/llm/llama3/xpu/#int4-woq-model

rohitpreddy07 commented 1 week ago

@feng-intel I am able to successfully download the model however I am running into the following error while tuning. [ERROR] Unexpected exception AttributeError("'DataLoader' object has no attribute 'split'") happened during tuning.

Here is the call log:

2024-07-08 16:30:52 [INFO] Quantize model without tuning!
2024-07-08 16:30:52 [INFO] Quantize the model with default configuration without evaluating the model.                To perform the tuning process, please either provide an eval_func or provide an
        eval_dataloader an eval_metric.
2024-07-08 16:30:52 [INFO] Adaptor has 5 recipes.
2024-07-08 16:30:52 [INFO] 0 recipes specified by user.
2024-07-08 16:30:52 [INFO] 3 recipes require future tuning.
2024-07-08 16:30:53 [INFO] *** Initialize auto tuning
Exception in thread Thread-5:
Traceback (most recent call last):
2024-07-08 16:30:53 [INFO] {
  File "C:\Users\rohit\miniconda3\envs\llm\lib\threading.py", line 980, in _bootstrap_inner
2024-07-08 16:30:53 [INFO]     'PostTrainingQuantConfig': {
2024-07-08 16:30:53 [INFO]         'AccuracyCriterion': {
2024-07-08 16:30:53 [INFO]             'criterion': 'relative',
2024-07-08 16:30:53 [INFO]             'higher_is_better': True,
2024-07-08 16:30:53 [INFO]             'tolerable_loss': 0.01,
2024-07-08 16:30:53 [INFO]             'absolute': None,
2024-07-08 16:30:53 [INFO]             'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x0000025376DAB8B0>>,
2024-07-08 16:30:53 [INFO]             'relative': 0.01
2024-07-08 16:30:53 [INFO]         },
2024-07-08 16:30:53 [INFO]         'approach': 'post_training_weight_only',
2024-07-08 16:30:53 [INFO]         'backend': 'default',
2024-07-08 16:30:53 [INFO]         'calibration_sampling_size': [
2024-07-08 16:30:53 [INFO]             100
    self.run()
  File "C:\Users\rohit\miniconda3\envs\llm\lib\threading.py", line 1304, in run
2024-07-08 16:30:53 [INFO]         ],
2024-07-08 16:30:53 [INFO]         'device': 'cpu',
2024-07-08 16:30:53 [INFO]         'diagnosis': False,
    self.finished.wait(self.interval)
  File "C:\Users\rohit\miniconda3\envs\llm\lib\threading.py", line 581, in wait
2024-07-08 16:30:53 [INFO]         'domain': 'auto',
2024-07-08 16:30:53 [INFO]         'example_inputs': 'Not printed here due to large size tensors...',
2024-07-08 16:30:53 [INFO]         'excluded_precisions': [
2024-07-08 16:30:53 [INFO]         ],
2024-07-08 16:30:53 [INFO]         'framework': 'pytorch_fx',
2024-07-08 16:30:53 [INFO]         'inputs': [
    signaled = self._cond.wait(timeout)
2024-07-08 16:30:53 [INFO]         ],
  File "C:\Users\rohit\miniconda3\envs\llm\lib\threading.py", line 316, in wait
2024-07-08 16:30:53 [INFO]         'model_name': '',
2024-07-08 16:30:53 [INFO]         'ni_workload_name': 'quantization',
2024-07-08 16:30:53 [INFO]         'op_name_dict': None,
2024-07-08 16:30:53 [INFO]         'op_type_dict': {
    gotit = waiter.acquire(True, timeout)
2024-07-08 16:30:53 [INFO]             '.*': {
OverflowError: timeout value is too large
2024-07-08 16:30:53 [INFO]                 'weight': {
2024-07-08 16:30:53 [INFO]                     'bits': [
2024-07-08 16:30:53 [INFO]                         4
2024-07-08 16:30:53 [INFO]                     ],
2024-07-08 16:30:53 [INFO]                     'dtype': [
2024-07-08 16:30:53 [INFO]                         'int4'
2024-07-08 16:30:53 [INFO]                     ],
2024-07-08 16:30:53 [INFO]                     'group_size': [
2024-07-08 16:30:53 [INFO]                         32
2024-07-08 16:30:53 [INFO]                     ],
2024-07-08 16:30:53 [INFO]                     'scheme': [
2024-07-08 16:30:53 [INFO]                         'sym'
2024-07-08 16:30:53 [INFO]                     ],
2024-07-08 16:30:53 [INFO]                     'algorithm': [
2024-07-08 16:30:53 [INFO]                         'AUTOROUND'
2024-07-08 16:30:53 [INFO]                     ]
2024-07-08 16:30:53 [INFO]                 }
2024-07-08 16:30:53 [INFO]             }
2024-07-08 16:30:53 [INFO]         },
2024-07-08 16:30:53 [INFO]         'outputs': [
2024-07-08 16:30:53 [INFO]         ],
2024-07-08 16:30:53 [INFO]         'quant_format': 'default',
2024-07-08 16:30:53 [INFO]         'quant_level': 'auto',
2024-07-08 16:30:53 [INFO]         'recipes': {
2024-07-08 16:30:53 [INFO]             'smooth_quant': False,
2024-07-08 16:30:53 [INFO]             'smooth_quant_args': {
2024-07-08 16:30:53 [INFO]             },
2024-07-08 16:30:53 [INFO]             'layer_wise_quant': False,
2024-07-08 16:30:53 [INFO]             'layer_wise_quant_args': {
2024-07-08 16:30:53 [INFO]             },
2024-07-08 16:30:53 [INFO]             'fast_bias_correction': False,
2024-07-08 16:30:53 [INFO]             'weight_correction': False,
2024-07-08 16:30:53 [INFO]             'gemm_to_matmul': True,
2024-07-08 16:30:53 [INFO]             'graph_optimization_level': None,
2024-07-08 16:30:53 [INFO]             'first_conv_or_matmul_quantization': True,
2024-07-08 16:30:53 [INFO]             'last_conv_or_matmul_quantization': True,
2024-07-08 16:30:53 [INFO]             'pre_post_process_quantization': True,
2024-07-08 16:30:53 [INFO]             'add_qdq_pair_to_weight': False,
2024-07-08 16:30:53 [INFO]             'optypes_to_exclude_output_quant': [
2024-07-08 16:30:53 [INFO]             ],
2024-07-08 16:30:53 [INFO]             'dedicated_qdq_pair': False,
2024-07-08 16:30:53 [INFO]             'rtn_args': {
2024-07-08 16:30:53 [INFO]             },
2024-07-08 16:30:53 [INFO]             'awq_args': {
2024-07-08 16:30:53 [INFO]             },
2024-07-08 16:30:53 [INFO]             'gptq_args': {
2024-07-08 16:30:53 [INFO]             },
2024-07-08 16:30:53 [INFO]             'teq_args': {
2024-07-08 16:30:53 [INFO]             },
2024-07-08 16:30:53 [INFO]             'autoround_args': {
2024-07-08 16:30:53 [INFO]                 'n_samples': 512,
2024-07-08 16:30:53 [INFO]                 'seq_len': 2048,
2024-07-08 16:30:53 [INFO]                 'iters': 200,
2024-07-08 16:30:53 [INFO]                 'scale_dtype': 'fp16',
2024-07-08 16:30:53 [INFO]                 'use_quant_input': True,
2024-07-08 16:30:53 [INFO]                 'lr': 0.005,
2024-07-08 16:30:53 [INFO]                 'minmax_lr': 0.01
2024-07-08 16:30:53 [INFO]             }
2024-07-08 16:30:53 [INFO]         },
2024-07-08 16:30:53 [INFO]         'reduce_range': None,
2024-07-08 16:30:53 [INFO]         'TuningCriterion': {
2024-07-08 16:30:53 [INFO]             'max_trials': 100,
2024-07-08 16:30:53 [INFO]             'objective': [
2024-07-08 16:30:53 [INFO]                 'performance'
2024-07-08 16:30:53 [INFO]             ],
2024-07-08 16:30:53 [INFO]             'strategy': 'basic',
2024-07-08 16:30:53 [INFO]             'strategy_kwargs': None,
2024-07-08 16:30:53 [INFO]             'timeout': 0
2024-07-08 16:30:53 [INFO]         },
2024-07-08 16:30:53 [INFO]         'use_bf16': True
2024-07-08 16:30:53 [INFO]     }
2024-07-08 16:30:53 [INFO] }
2024-07-08 16:30:53 [WARNING] [Strategy] Please install `mpi4py` correctly if using distributed tuning; otherwise, ignore this warning.
2024-07-08 16:30:53 [INFO] Pass query framework capability elapsed time: 19.0 ms
2024-07-08 16:30:53 [INFO] Do not evaluate the baseline and quantize the model with default configuration.
2024-07-08 16:30:53 [INFO] Quantize the model with default config.
2024-07-08 16:30:53 [INFO] All algorithms to do: {'AUTOROUND'}
2024-07-08 16:30:53 [INFO] quantizing with the AutoRound algorithm
2024-07-08 16:30:53 INFO utils.py L538: Using CPU device
2024-07-08 16:30:56 WARNING autoround.py L480: amp is set to FALSE as the currentdevice does not support the 'bf16' data type.
2024-07-08 16:30:56 INFO autoround.py L485: using torch.float32 for quantization tuning
2024-07-08 16:30:56 [ERROR] Unexpected exception AttributeError("'DataLoader' object has no attribute 'split'") happened during tuning.

feng-intel commented 1 week ago

What's your full command line? What's the dtype of model, fp32 or fp16 ?

rohitpreddy07 commented 1 week ago

The command line is as follows from the aforementioned site.

    --model $model_path \
    --woq --woq_algo AutoRound  \
    --use_quant_input \
    --calib_iters 200  \
    --lr 5e-3 \
    --minmax_lr 1e-2 \
    --output_dir llama3_all_int4  \
    --nsamples 512

The dtype of the llama3 model i think is fp32 and the command line tries to quantize it to output a quantized int4 version of llama3 that can be run locally. The model path is "meta-llama/Meta-Llama-3-8B"

intel / intel-extension-for-pytorch

RuntimeError: XPU out of memory. #672

Describe the issue