Open rohitpreddy07 opened 2 weeks ago
Hi What's the platform?
I'm using a Lenovo Yoga Slim 7 with Intel Core Ultra 7 and 32GB RAM
Do you want run llama3 on iGPU? What's the Memory size of iGPU?
I tried running Qwen/Qwen-7B-Chat model on the iGPU which has a shared memory space of 15.8 GB, but while loading the safetensor files I am running out of space. But I do eventually want to run llama3 on it.
@feng-intel I am able to successfully download the model however I am running into the following error while tuning.
[ERROR] Unexpected exception AttributeError("'DataLoader' object has no attribute 'split'") happened during tuning.
Here is the call log:
2024-07-08 16:30:52 [INFO] Quantize model without tuning!
2024-07-08 16:30:52 [INFO] Quantize the model with default configuration without evaluating the model. To perform the tuning process, please either provide an eval_func or provide an
eval_dataloader an eval_metric.
2024-07-08 16:30:52 [INFO] Adaptor has 5 recipes.
2024-07-08 16:30:52 [INFO] 0 recipes specified by user.
2024-07-08 16:30:52 [INFO] 3 recipes require future tuning.
2024-07-08 16:30:53 [INFO] *** Initialize auto tuning
Exception in thread Thread-5:
Traceback (most recent call last):
2024-07-08 16:30:53 [INFO] {
File "C:\Users\rohit\miniconda3\envs\llm\lib\threading.py", line 980, in _bootstrap_inner
2024-07-08 16:30:53 [INFO] 'PostTrainingQuantConfig': {
2024-07-08 16:30:53 [INFO] 'AccuracyCriterion': {
2024-07-08 16:30:53 [INFO] 'criterion': 'relative',
2024-07-08 16:30:53 [INFO] 'higher_is_better': True,
2024-07-08 16:30:53 [INFO] 'tolerable_loss': 0.01,
2024-07-08 16:30:53 [INFO] 'absolute': None,
2024-07-08 16:30:53 [INFO] 'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x0000025376DAB8B0>>,
2024-07-08 16:30:53 [INFO] 'relative': 0.01
2024-07-08 16:30:53 [INFO] },
2024-07-08 16:30:53 [INFO] 'approach': 'post_training_weight_only',
2024-07-08 16:30:53 [INFO] 'backend': 'default',
2024-07-08 16:30:53 [INFO] 'calibration_sampling_size': [
2024-07-08 16:30:53 [INFO] 100
self.run()
File "C:\Users\rohit\miniconda3\envs\llm\lib\threading.py", line 1304, in run
2024-07-08 16:30:53 [INFO] ],
2024-07-08 16:30:53 [INFO] 'device': 'cpu',
2024-07-08 16:30:53 [INFO] 'diagnosis': False,
self.finished.wait(self.interval)
File "C:\Users\rohit\miniconda3\envs\llm\lib\threading.py", line 581, in wait
2024-07-08 16:30:53 [INFO] 'domain': 'auto',
2024-07-08 16:30:53 [INFO] 'example_inputs': 'Not printed here due to large size tensors...',
2024-07-08 16:30:53 [INFO] 'excluded_precisions': [
2024-07-08 16:30:53 [INFO] ],
2024-07-08 16:30:53 [INFO] 'framework': 'pytorch_fx',
2024-07-08 16:30:53 [INFO] 'inputs': [
signaled = self._cond.wait(timeout)
2024-07-08 16:30:53 [INFO] ],
File "C:\Users\rohit\miniconda3\envs\llm\lib\threading.py", line 316, in wait
2024-07-08 16:30:53 [INFO] 'model_name': '',
2024-07-08 16:30:53 [INFO] 'ni_workload_name': 'quantization',
2024-07-08 16:30:53 [INFO] 'op_name_dict': None,
2024-07-08 16:30:53 [INFO] 'op_type_dict': {
gotit = waiter.acquire(True, timeout)
2024-07-08 16:30:53 [INFO] '.*': {
OverflowError: timeout value is too large
2024-07-08 16:30:53 [INFO] 'weight': {
2024-07-08 16:30:53 [INFO] 'bits': [
2024-07-08 16:30:53 [INFO] 4
2024-07-08 16:30:53 [INFO] ],
2024-07-08 16:30:53 [INFO] 'dtype': [
2024-07-08 16:30:53 [INFO] 'int4'
2024-07-08 16:30:53 [INFO] ],
2024-07-08 16:30:53 [INFO] 'group_size': [
2024-07-08 16:30:53 [INFO] 32
2024-07-08 16:30:53 [INFO] ],
2024-07-08 16:30:53 [INFO] 'scheme': [
2024-07-08 16:30:53 [INFO] 'sym'
2024-07-08 16:30:53 [INFO] ],
2024-07-08 16:30:53 [INFO] 'algorithm': [
2024-07-08 16:30:53 [INFO] 'AUTOROUND'
2024-07-08 16:30:53 [INFO] ]
2024-07-08 16:30:53 [INFO] }
2024-07-08 16:30:53 [INFO] }
2024-07-08 16:30:53 [INFO] },
2024-07-08 16:30:53 [INFO] 'outputs': [
2024-07-08 16:30:53 [INFO] ],
2024-07-08 16:30:53 [INFO] 'quant_format': 'default',
2024-07-08 16:30:53 [INFO] 'quant_level': 'auto',
2024-07-08 16:30:53 [INFO] 'recipes': {
2024-07-08 16:30:53 [INFO] 'smooth_quant': False,
2024-07-08 16:30:53 [INFO] 'smooth_quant_args': {
2024-07-08 16:30:53 [INFO] },
2024-07-08 16:30:53 [INFO] 'layer_wise_quant': False,
2024-07-08 16:30:53 [INFO] 'layer_wise_quant_args': {
2024-07-08 16:30:53 [INFO] },
2024-07-08 16:30:53 [INFO] 'fast_bias_correction': False,
2024-07-08 16:30:53 [INFO] 'weight_correction': False,
2024-07-08 16:30:53 [INFO] 'gemm_to_matmul': True,
2024-07-08 16:30:53 [INFO] 'graph_optimization_level': None,
2024-07-08 16:30:53 [INFO] 'first_conv_or_matmul_quantization': True,
2024-07-08 16:30:53 [INFO] 'last_conv_or_matmul_quantization': True,
2024-07-08 16:30:53 [INFO] 'pre_post_process_quantization': True,
2024-07-08 16:30:53 [INFO] 'add_qdq_pair_to_weight': False,
2024-07-08 16:30:53 [INFO] 'optypes_to_exclude_output_quant': [
2024-07-08 16:30:53 [INFO] ],
2024-07-08 16:30:53 [INFO] 'dedicated_qdq_pair': False,
2024-07-08 16:30:53 [INFO] 'rtn_args': {
2024-07-08 16:30:53 [INFO] },
2024-07-08 16:30:53 [INFO] 'awq_args': {
2024-07-08 16:30:53 [INFO] },
2024-07-08 16:30:53 [INFO] 'gptq_args': {
2024-07-08 16:30:53 [INFO] },
2024-07-08 16:30:53 [INFO] 'teq_args': {
2024-07-08 16:30:53 [INFO] },
2024-07-08 16:30:53 [INFO] 'autoround_args': {
2024-07-08 16:30:53 [INFO] 'n_samples': 512,
2024-07-08 16:30:53 [INFO] 'seq_len': 2048,
2024-07-08 16:30:53 [INFO] 'iters': 200,
2024-07-08 16:30:53 [INFO] 'scale_dtype': 'fp16',
2024-07-08 16:30:53 [INFO] 'use_quant_input': True,
2024-07-08 16:30:53 [INFO] 'lr': 0.005,
2024-07-08 16:30:53 [INFO] 'minmax_lr': 0.01
2024-07-08 16:30:53 [INFO] }
2024-07-08 16:30:53 [INFO] },
2024-07-08 16:30:53 [INFO] 'reduce_range': None,
2024-07-08 16:30:53 [INFO] 'TuningCriterion': {
2024-07-08 16:30:53 [INFO] 'max_trials': 100,
2024-07-08 16:30:53 [INFO] 'objective': [
2024-07-08 16:30:53 [INFO] 'performance'
2024-07-08 16:30:53 [INFO] ],
2024-07-08 16:30:53 [INFO] 'strategy': 'basic',
2024-07-08 16:30:53 [INFO] 'strategy_kwargs': None,
2024-07-08 16:30:53 [INFO] 'timeout': 0
2024-07-08 16:30:53 [INFO] },
2024-07-08 16:30:53 [INFO] 'use_bf16': True
2024-07-08 16:30:53 [INFO] }
2024-07-08 16:30:53 [INFO] }
2024-07-08 16:30:53 [WARNING] [Strategy] Please install `mpi4py` correctly if using distributed tuning; otherwise, ignore this warning.
2024-07-08 16:30:53 [INFO] Pass query framework capability elapsed time: 19.0 ms
2024-07-08 16:30:53 [INFO] Do not evaluate the baseline and quantize the model with default configuration.
2024-07-08 16:30:53 [INFO] Quantize the model with default config.
2024-07-08 16:30:53 [INFO] All algorithms to do: {'AUTOROUND'}
2024-07-08 16:30:53 [INFO] quantizing with the AutoRound algorithm
2024-07-08 16:30:53 INFO utils.py L538: Using CPU device
2024-07-08 16:30:56 WARNING autoround.py L480: amp is set to FALSE as the currentdevice does not support the 'bf16' data type.
2024-07-08 16:30:56 INFO autoround.py L485: using torch.float32 for quantization tuning
2024-07-08 16:30:56 [ERROR] Unexpected exception AttributeError("'DataLoader' object has no attribute 'split'") happened during tuning.
What's your full command line? What's the dtype of model, fp32 or fp16 ?
The command line is as follows from the aforementioned site.
--model $model_path \
--woq --woq_algo AutoRound \
--use_quant_input \
--calib_iters 200 \
--lr 5e-3 \
--minmax_lr 1e-2 \
--output_dir llama3_all_int4 \
--nsamples 512
The dtype of the llama3 model i think is fp32 and the command line tries to quantize it to output a quantized int4 version of llama3 that can be run locally. The model path is "meta-llama/Meta-Llama-3-8B"
Describe the issue
I am trying to replicate the following : https://intel.github.io/intel-extension-for-pytorch/llm/llama3/xpu/ . While running the
python run_generation_gpu_woq_for_llama.py
script with the--benchmark
,accuracy
,--trust_remote_code
andprofile_token_latency
flags, I am running into the following error.Is there a way I can perhaps offload the rest of the workload to the cpu? Or are there any other solutions?