Open raymondbernard opened 1 month ago
I don't believe PyTorch supports Intel GPUs natively. You might need to install some third-party packages to enable this. See https://pytorch.org/tutorials/recipes/intel_extension_for_pytorch.html for an example, this is an official Intel package for better support with Intel CPU and GPU. Although, I cannot guarantee that all features of torchtune will work with this extension. Let me know how it goes!
@RdoubleA --I will give it a shot and let you know. I will start by adjusting a single GPU recipe and config file to see if I can get it to work.
I am relatively new to using Torchtune. From what I understand, it is designed to facilitate LLM training on consumer-grade Nvidia-based GPUs. However, the process involves some deep abstractions, making it more complex than it initially seems. Here are the steps that are necessary to achieve this:
intel_extension_for_pytorch
as ipex
.ipex.optimize
function for additional performance enhancements, which applies optimizations to both the model and the optimizer.Here's an example implementation:
import torch
import intel_extension_for_pytorch as ipex
# Initialize the model, criterion, and optimizer
model = Model()
criterion = ...
optimizer = ...
model.train()
# Move the model and loss criterion to XPU before calling ipex.optimize()
model = model.to("xpu")
criterion = criterion.to("xpu")
# Optimize the model and optimizer for Float32
model, optimizer = ipex.optimize(model, optimizer=optimizer)
# Optimize the model and optimizer for BFloat16
model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16)
# Prepare the dataloader
dataloader = ...
for input, target in dataloader:
input = input.to("xpu")
target = target.to("xpu")
optimizer.zero_grad()
# For Float32
output = model(input)
# For BFloat16
with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
output = model(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()
it would be great if the maintainer would point us in the proper direction. I will take this up again this week.
@RdoubleA Intel GPUs are supported in pytorch version 2.30 https://github.com/pytorch/pytorch?tab=readme-ov-file#intel-gpu-support
We should be able to support intel GPUs! We are using the intel developer cloud. Please advise.
Distributor ID: Ubuntu Description: Ubuntu 22.04.4 LTS Release: 22.04 Codename: jammy Python 3.9.18 (tags/v3.9.18-26-g6b320c3b2f6-dirty:6b320c3b2f6, Sep 28 2023, 00:35:27) [GCC 13.2.0] :: Intel Corporation on linux (null)Type "help", "copyright", "credits" or "license" for more information. Intel(R) Distribution for Python is brought to you by Intel Corporation. Please check out: https://software.intel.com/en-us/python-distribution
Notebook commands: !echo "List of Intel GPUs available on the system:" !xpu-smi discovery 2> /dev/null !echo "Intel Xeon CPU used by this notebook:" !lscpu | grep "Model name"
List of Intel GPUs available on the system: +-----------+--------------------------------------------------------------------------------------+ | Device ID | Device Information | +-----------+--------------------------------------------------------------------------------------+ | 0 | Device Name: Intel(R) Data Center GPU Max 1100 | | | Vendor Name: Intel(R) Corporation | | | SOC UUID: 00000000-0000-0029-0000-002f0bda8086 | | | PCI BDF Address: 0000:29:00.0 | | | DRM Device: /dev/dri/card0 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ | 1 | Device Name: Intel(R) Data Center GPU Max 1100 | | | Vendor Name: Intel(R) Corporation | | | SOC UUID: 00000000-0000-003a-0000-002f0bda8086 | | | PCI BDF Address: 0000:3a:00.0 | | | DRM Device: /dev/dri/card2 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ | 2 | Device Name: Intel(R) Data Center GPU Max 1100 | | | Vendor Name: Intel(R) Corporation | | | SOC UUID: 00000000-0000-009a-0000-002f0bda8086 | | | PCI BDF Address: 0000:9a:00.0 | | | DRM Device: /dev/dri/card3 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ | 3 | Device Name: Intel(R) Data Center GPU Max 1100 | | | Vendor Name: Intel(R) Corporation | | | SOC UUID: 00000000-0000-00ca-0000-002f0bda8086 | | | PCI BDF Address: 0000:ca:00.0 | | | DRM Device: /dev/dri/card4 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ Intel Xeon CPU used by this notebook: Model name: Intel(R) Xeon(R) Platinum 8480+
I discovered that Intel GPU doesn't seem to be supported because originally tried to run my training job across the 4 GPUS and got the following:
$ tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/recipes/configs/llama3/8B_qlora_single_device.yaml Running with torchrun... W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] Traceback (most recent call last): Traceback (most recent call last): File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in
sys.exit(recipe_main())sys.exit(recipe_main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper sys.exit(recipe_main(conf)) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl") File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper sys.exit(recipe_main(conf)) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl") File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, kwargs) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper return func(*args, *kwargs) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper func_return = func(args, kwargs) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group func_return = func(*args, kwargs) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group defaultpg, = _new_process_group_helper( File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper defaultpg, = _new_process_group_helper( File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper backend_class = ProcessGroupNCCL( ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! backend_class = ProcessGroupNCCL( ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! Traceback (most recent call last): File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in
sys.exit(recipe_main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
Traceback (most recent call last):
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main
init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl")
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, *kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
sys.exit(recipe_main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
func_return = func(args, kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main
init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl")
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, *kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
defaultpg, = _new_process_group_helper(
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper
func_return = func(args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group
backend_class = ProcessGroupNCCL(
ValueError : defaultpg, = _new_process_group_helper(ProcessGroupNCCL is only supported with GPUs, no GPUs found!
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper backend_class = ProcessGroupNCCL( ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! E0517 11:41:43.068940 23389872468672 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 400561) of binary: /opt/intel/oneapi/intelpython/bin/python3.9 Traceback (most recent call last): File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/bin/tune", line 8, in
sys.exit(main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 177, in _run_cmd
self._run_distributed(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 88, in _run_distributed
run(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py FAILED
Failures: [1]: time : 2024-05-17_11:41:43 host : idc-beta-batch-pvc-node-18 rank : 1 (local_rank: 1) exitcode : 1 (pid: 400562) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-05-17_11:41:43 host : idc-beta-batch-pvc-node-18 rank : 2 (local_rank: 2) exitcode : 1 (pid: 400563) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-05-17_11:41:43 host : idc-beta-batch-pvc-node-18 rank : 3 (local_rank: 3) exitcode : 1 (pid: 400565) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2024-05-17_11:41:43 host : idc-beta-batch-pvc-node-18 rank : 0 (local_rank: 0) exitcode : 1 (pid: 400561) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
u2b3e96b2fc320ef8c781f51df67225d@idc-beta-batch-pvc-node-18:~$ tune run lora_finetune_single_device --config /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/recipes/configs/llama3/8B_qlora_single_device.yaml INFO:torchtune.utils.logging:Running LoRAFinetuneRecipeSingleDevice with resolved config:
batch_size: 2 checkpointer: component: torchtune.utils.FullModelMetaCheckpointer checkpoint_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/original/ checkpoint_files:
Traceback (most recent call last): File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/bin/tune", line 8, in
sys.exit(main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 179, in _run_cmd
self._run_single_device(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 93, in _run_single_device
runpy.run_path(str(args.recipe), run_name="main")
File "/opt/intel/oneapi/intelpython/lib/python3.9/runpy.py", line 288, in run_path
return _run_module_code(code, init_globals, run_name,
File "/opt/intel/oneapi/intelpython/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/opt/intel/oneapi/intelpython/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 550, in
sys.exit(recipe_main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 543, in recipe_main
recipe = LoRAFinetuneRecipeSingleDevice(cfg=cfg)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 100, in init
self._device = utils.get_device(device=cfg.device)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/utils/_device.py", line 117, in get_device
device = _setup_cuda_device(device)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/utils/_device.py", line 44, in _setup_cuda_device
raise RuntimeError(
RuntimeError: The local rank is larger than the number of available GPUs.