AssertionError: BF16 weight prepack needs the cpu support avx512bw, avx512vl and avx512dq

iodone commented 2 months ago

CPU Info：

Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz

OS Info:

OS: CentOS Linux release 7.3.1611 (Core)
Kernel: 3.10.0-1160.1.0.el7.x86_64

Finetune Command

python llm_on_ray/finetune/finetune.py --config_file llm_on_ray/finetune/models/opt-125m.yaml

Complete exception stack：


RayTrainWorker pid=164529, ip=10.142.64.53) Warning: Cannot load xpu CCL. CCL doesn't work for XPU device
(RayTrainWorker pid=29201) Warning: Cannot load xpu CCL. CCL doesn't work for XPU device
(RayTrainWorker pid=164529, ip=10.142.64.53) Setting up process group for: env:// [rank=0, world_size=2]
(RayTrainWorker pid=29201) /home/work/app/miniconda/envs/llm-on-ray/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
(RayTrainWorker pid=29201)   warnings.warn(
(RayTrainWorker pid=164529, ip=10.142.64.53) /home/work/app/miniconda/envs/llm-on-ray/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
(RayTrainWorker pid=164529, ip=10.142.64.53)   warnings.warn(
Prompt: 100%|██████████| 28/28 [00:00<00:00, 2227.46 examples/s]
Prompt: 100%|██████████| 2/2 [00:00<00:00, 358.93 examples/s]
Prompt: 100%|██████████| 2/2 [00:00<00:00, 359.55 examples/s]
Tokenize dataset:   0%|          | 0/28 [00:00<?, ? examples/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Tokenize dataset: 100%|██████████| 28/28 [00:00<00:00, 469.86 examples/s]
Tokenize dataset: 100%|██████████| 2/2 [00:00<00:00, 277.29 examples/s]
Tokenize dataset: 100%|██████████| 2/2 [00:00<00:00, 285.47 examples/s]
Grouping texts in chunks of 512: 100%|██████████| 28/28 [00:00<00:00, 1494.80 examples/s]
Grouping texts in chunks of 512:   0%|          | 0/2 [00:00<?, ? examples/s]
Grouping texts in chunks of 512: 100%|██████████| 2/2 [00:00<00:00, 317.53 examples/s]
Grouping texts in chunks of 512: 100%|██████████| 2/2 [00:00<00:00, 322.68 examples/s]
(RayTrainWorker pid=164529, ip=10.142.64.53) /home/work/app/miniconda/envs/llm-on-ray/lib/python3.9/site-packages/transformers/training_args.py:1376: FutureWarning: using `no_cuda` is deprecated and will be removed in version 5.0 of 🤗 Transformers. Use `use_cpu` instead
(RayTrainWorker pid=164529, ip=10.142.64.53)   warnings.warn(
(RayTrainWorker pid=164529, ip=10.142.64.53) /home/work/app/miniconda/envs/llm-on-ray/lib/python3.9/site-packages/accelerate/accelerator.py:444: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
(RayTrainWorker pid=164529, ip=10.142.64.53) dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
(RayTrainWorker pid=164529, ip=10.142.64.53)   warnings.warn(
(RayTrainWorker pid=164529, ip=10.142.64.53) 2024-06-19 21:15:32,367 - accelerate.utils.other - WARNING - Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
(RayTrainWorker pid=164529, ip=10.142.64.53) 2024-06-19 21:15:32,373 - common - INFO - train start
(RayTrainWorker pid=164529, ip=10.142.64.53) /home/work/app/miniconda/envs/llm-on-ray/lib/python3.9/site-packages/intel_extension_for_pytorch/frontend.py:507: UserWarning: IPEX CPU does not support fused/fused split update for <class 'torch.optim.adamw.AdamW'> will use non-fused master weight update for bf16 training on CPU.
(RayTrainWorker pid=164529, ip=10.142.64.53)   warnings.warn(
(RayTrainWorker pid=164529, ip=10.142.64.53) /home/work/app/miniconda/envs/llm-on-ray/lib/python3.9/site-packages/intel_extension_for_pytorch/nn/utils/_parameter_wrapper.py:332: UserWarning: WARNING: Can't convert model's parameters dtype from torch.bfloat16 to torch.bfloat16
(RayTrainWorker pid=164529, ip=10.142.64.53)   warnings.warn(
ray.exceptions.RayTaskError(AssertionError): ray::TrainTrainable.train() (pid=164395, ip=10.142.64.53, actor_id=3fe2737abc506c2e57b7920d0c000000, repr=TorchTrainer)
File "/home/work/app/miniconda/envs/llm-on-ray/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/home/work/app/miniconda/envs/llm-on-ray/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 53, in check_for_failure
ray.get(object_ref)
ray.exceptions.RayTaskError(AssertionError): ray::_RayTrainWorker__execute.get_next() (pid=164529, ip=10.142.64.53, actor_id=da5a5d80e1c9366dbf676c560c000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f826c247f40>)
File "/home/work/app/miniconda/envs/llm-on-ray/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
raise skipped from exception_cause(skipped)
File "/home/work/app/miniconda/envs/llm-on-ray/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 169, in discard_return_wrapper
train_func(*args, **kwargs)
File "/home/work/app/llm-on-ray/llm_on_ray/finetune/finetune.py", line 371, in train_func
trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/home/work/app/miniconda/envs/llm-on-ray/lib/python3.9/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/home/work/app/miniconda/envs/llm-on-ray/lib/python3.9/site-packages/transformers/trainer.py", line 1757, in _inner_training_loop
model = self._wrap_model(self.model_wrapped)
File "/home/work/app/miniconda/envs/llm-on-ray/lib/python3.9/site-packages/transformers/trainer.py", line 1376, in _wrap_model
model = self.ipex_optimize_model(model, training, dtype=dtype)
File "/home/work/app/miniconda/envs/llm-on-ray/lib/python3.9/site-packages/transformers/trainer.py", line 1367, in ipex_optimize_model
model, self.optimizer = ipex.optimize(
File "/home/work/app/miniconda/envs/llm-on-ray/lib/python3.9/site-packages/intel_extension_for_pytorch/frontend.py", line 550, in optimize
assert core.onednn_has_bf16_support(), (
AssertionError: BF16 weight prepack needs the cpu support avx512bw, avx512vl and avx512dq, but the desired instruction sets are not available. Please set dtype to torch.float or set weights_prepack to False.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/work/app/llm-on-ray/llm_on_ray/finetune/finetune.py", line 502, in main() File "/home/work/app/llm-on-ray/llm_on_ray/finetune/finetune.py", line 496, in main results = trainer.fit()

xwu99 commented 2 months ago

@iodone Your CPU doesn't support some AVX512 instructions, could you check with lscpu | grep avx512. FP16 is not supported if AVX512 is not available, in that case you need to set dtype to float (much slower).

iodone commented 2 months ago

@iodone Your CPU doesn't support some AVX512 instructions, could you check with lscpu | grep avx512. FP16 is not supported if AVX512 is not available, in that case you need to set dtype to float (much slower).

Thank you for the reply. I checked that the CPU does not support avx512. Where in the code should set dtype to float be changed?

xwu99 commented 2 months ago

@iodone Your CPU doesn't support some AVX512 instructions, could you check with lscpu | grep avx512. FP16 is not supported if AVX512 is not available, in that case you need to set dtype to float (much slower).

Thank you for the reply. I checked that the CPU does not support avx512. Where in the code should set dtype to float be changed?

you can set that in yaml file

rain7996 commented 2 months ago

@iodone Your CPU doesn't support some AVX512 instructions, could you check with lscpu | grep avx512. FP16 is not supported if AVX512 is not available, in that case you need to set dtype to float (much slower).

Thank you for the reply. I checked that the CPU does not support avx512. Where in the code should set dtype to float be changed?

you can set that in yaml file

how do I set them? In what format? Just dtype: torch.float in the yaml file?

xwu99 commented 2 months ago

@iodone Your CPU doesn't support some AVX512 instructions, could you check with lscpu | grep avx512. FP16 is not supported if AVX512 is not available, in that case you need to set dtype to float (much slower).

Thank you for the reply. I checked that the CPU does not support avx512. Where in the code should set dtype to float be changed?

you can set that in yaml file

how do I set them? In what format? Just dtype: torch.float in the yaml file?

Hi, my mistake. float is only supported in inference. only fp16 and bf16 are supported for finetuning.

https://github.com/intel/llm-on-ray/blob/main/docs/finetune_parameters.md

mixed_precision | no | Whether or not to use mixed precision training. Choose from "no", "fp16", "bf16". Default is "no" if not set.

intel / llm-on-ray

AssertionError: BF16 weight prepack needs the cpu support avx512bw, avx512vl and avx512dq #255