raymondbernard commented 1 month ago

We should be able to support intel GPUs! We are using the intel developer cloud. Please advise.

Distributor ID: Ubuntu Description: Ubuntu 22.04.4 LTS Release: 22.04 Codename: jammy Python 3.9.18 (tags/v3.9.18-26-g6b320c3b2f6-dirty:6b320c3b2f6, Sep 28 2023, 00:35:27) [GCC 13.2.0] :: Intel Corporation on linux (null)Type "help", "copyright", "credits" or "license" for more information. Intel(R) Distribution for Python is brought to you by Intel Corporation. Please check out: https://software.intel.com/en-us/python-distribution

import torch print(torch.version) 2.3.0+cu121

Notebook commands: !echo "List of Intel GPUs available on the system:" !xpu-smi discovery 2> /dev/null !echo "Intel Xeon CPU used by this notebook:" !lscpu | grep "Model name"

List of Intel GPUs available on the system: +-----------+--------------------------------------------------------------------------------------+ | Device ID | Device Information | +-----------+--------------------------------------------------------------------------------------+ | 0 | Device Name: Intel(R) Data Center GPU Max 1100 | | | Vendor Name: Intel(R) Corporation | | | SOC UUID: 00000000-0000-0029-0000-002f0bda8086 | | | PCI BDF Address: 0000:29:00.0 | | | DRM Device: /dev/dri/card0 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ | 1 | Device Name: Intel(R) Data Center GPU Max 1100 | | | Vendor Name: Intel(R) Corporation | | | SOC UUID: 00000000-0000-003a-0000-002f0bda8086 | | | PCI BDF Address: 0000:3a:00.0 | | | DRM Device: /dev/dri/card2 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ | 2 | Device Name: Intel(R) Data Center GPU Max 1100 | | | Vendor Name: Intel(R) Corporation | | | SOC UUID: 00000000-0000-009a-0000-002f0bda8086 | | | PCI BDF Address: 0000:9a:00.0 | | | DRM Device: /dev/dri/card3 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ | 3 | Device Name: Intel(R) Data Center GPU Max 1100 | | | Vendor Name: Intel(R) Corporation | | | SOC UUID: 00000000-0000-00ca-0000-002f0bda8086 | | | PCI BDF Address: 0000:ca:00.0 | | | DRM Device: /dev/dri/card4 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ Intel Xeon CPU used by this notebook: Model name: Intel(R) Xeon(R) Platinum 8480+

I discovered that Intel GPU doesn't seem to be supported because originally tried to run my training job across the 4 GPUS and got the following:

$ tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/recipes/configs/llama3/8B_qlora_single_device.yaml Running with torchrun... W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] Traceback (most recent call last): Traceback (most recent call last): File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in sys.exit(recipe_main())sys.exit(recipe_main())

File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper sys.exit(recipe_main(conf)) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl") File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper sys.exit(recipe_main(conf)) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl") File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, kwargs) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper return func(*args, *kwargs) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper func_return = func(args, kwargs) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group func_return = func(*args, kwargs) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group defaultpg, = _new_process_group_helper( File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper defaultpg, = _new_process_group_helper( File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper backend_class = ProcessGroupNCCL( ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! backend_class = ProcessGroupNCCL( ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! Traceback (most recent call last): File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in sys.exit(recipe_main()) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper Traceback (most recent call last): File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in sys.exit(recipe_main(conf)) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl") File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, *kwargs) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper sys.exit(recipe_main()) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper func_return = func(args, kwargs) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group sys.exit(recipe_main(conf)) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl") File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, *kwargs) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper defaultpg, = _new_process_group_helper( File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper func_return = func(args, **kwargs) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group backend_class = ProcessGroupNCCL( ValueError : defaultpg, = _new_process_group_helper(ProcessGroupNCCL is only supported with GPUs, no GPUs found!

File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper backend_class = ProcessGroupNCCL( ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! E0517 11:41:43.068940 23389872468672 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 400561) of binary: /opt/intel/oneapi/intelpython/bin/python3.9 Traceback (most recent call last): File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/bin/tune", line 8, in sys.exit(main()) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 49, in main parser.run(args) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 43, in run args.func(args) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 177, in _run_cmd self._run_distributed(args) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 88, in _run_distributed run(args) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py FAILED

Failures: [1]: time : 2024-05-17_11:41:43 host : idc-beta-batch-pvc-node-18 rank : 1 (local_rank: 1) exitcode : 1 (pid: 400562) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-05-17_11:41:43 host : idc-beta-batch-pvc-node-18 rank : 2 (local_rank: 2) exitcode : 1 (pid: 400563) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-05-17_11:41:43 host : idc-beta-batch-pvc-node-18 rank : 3 (local_rank: 3) exitcode : 1 (pid: 400565) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-05-17_11:41:43 host : idc-beta-batch-pvc-node-18 rank : 0 (local_rank: 0) exitcode : 1 (pid: 400561) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

u2b3e96b2fc320ef8c781f51df67225d@idc-beta-batch-pvc-node-18:~$ tune run lora_finetune_single_device --config /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/recipes/configs/llama3/8B_qlora_single_device.yaml INFO:torchtune.utils.logging:Running LoRAFinetuneRecipeSingleDevice with resolved config:

batch_size: 2 checkpointer: component: torchtune.utils.FullModelMetaCheckpointer checkpoint_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/original/ checkpoint_files:

consolidated.00.pth model_type: LLAMA3 output_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/ recipe_checkpoint: null compile: false dataset: component: torchtune.datasets.alpaca_cleaned_dataset train_on_input: true device: cuda dtype: bf16 enable_activation_checkpointing: true epochs: 1 gradient_accumulation_steps: 16 log_every_n_steps: 1 log_peak_memory_stats: false loss: component: torch.nn.CrossEntropyLoss lr_scheduler: component: torchtune.modules.get_cosine_schedule_with_warmup num_warmup_steps: 100 max_steps_per_epoch: null metric_logger: component: torchtune.utils.metric_logging.DiskLogger log_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/tmp/qlora_finetune_output/ model: component: torchtune.models.llama3.qlora_llama3_8b apply_lora_to_mlp: true apply_lora_to_output: false lora_alpha: 16 lora_attn_modules:
q_proj
v_proj
k_proj
output_proj lora_rank: 8 optimizer: component: torch.optim.AdamW lr: 0.0003 weight_decay: 0.01 output_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/tmp/qlora_finetune_output/ profiler: component: torchtune.utils.profiler enabled: false resume_from_checkpoint: false seed: null shuffle: true tokenizer: component: torchtune.models.llama3.llama3_tokenizer path: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model

Traceback (most recent call last): File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/bin/tune", line 8, in sys.exit(main()) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 49, in main parser.run(args) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 43, in run args.func(args) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 179, in _run_cmd self._run_single_device(args) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 93, in _run_single_device runpy.run_path(str(args.recipe), run_name="main") File "/opt/intel/oneapi/intelpython/lib/python3.9/runpy.py", line 288, in run_path return _run_module_code(code, init_globals, run_name, File "/opt/intel/oneapi/intelpython/lib/python3.9/runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "/opt/intel/oneapi/intelpython/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 550, in sys.exit(recipe_main()) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper sys.exit(recipe_main(conf)) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 543, in recipe_main recipe = LoRAFinetuneRecipeSingleDevice(cfg=cfg) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 100, in init self._device = utils.get_device(device=cfg.device) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/utils/_device.py", line 117, in get_device device = _setup_cuda_device(device) File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/utils/_device.py", line 44, in _setup_cuda_device raise RuntimeError( RuntimeError: The local rank is larger than the number of available GPUs.

RdoubleA commented 1 month ago

I don't believe PyTorch supports Intel GPUs natively. You might need to install some third-party packages to enable this. See https://pytorch.org/tutorials/recipes/intel_extension_for_pytorch.html for an example, this is an official Intel package for better support with Intel CPU and GPU. Although, I cannot guarantee that all features of torchtune will work with this extension. Let me know how it goes!

raymondbernard commented 1 month ago

@RdoubleA --I will give it a shot and let you know. I will start by adjusting a single GPU recipe and config file to see if I can get it to work.

raymondbernard commented 1 month ago

I am relatively new to using Torchtune. From what I understand, it is designed to facilitate LLM training on consumer-grade Nvidia-based GPUs. However, the process involves some deep abstractions, making it more complex than it initially seems. Here are the steps that are necessary to achieve this:

Import intel_extension_for_pytorch as ipex.
Use the ipex.optimize function for additional performance enhancements, which applies optimizations to both the model and the optimizer.
Utilize Auto Mixed Precision (AMP) with the BFloat16 data type.
Convert input tensors, the loss criterion, and the model to the XPU.

Here's an example implementation:

import torch
import intel_extension_for_pytorch as ipex

# Initialize the model, criterion, and optimizer
model = Model()
criterion = ...
optimizer = ...

model.train()

# Move the model and loss criterion to XPU before calling ipex.optimize()
model = model.to("xpu")
criterion = criterion.to("xpu")

# Optimize the model and optimizer for Float32
model, optimizer = ipex.optimize(model, optimizer=optimizer)

# Optimize the model and optimizer for BFloat16
model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16)

# Prepare the dataloader
dataloader = ...

for input, target in dataloader:
    input = input.to("xpu")
    target = target.to("xpu")
    optimizer.zero_grad()

    # For Float32
    output = model(input)

    # For BFloat16
    with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
        output = model(input)

    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

raymondbernard commented 1 month ago

it would be great if the maintainer would point us in the proper direction. I will take this up again this week.

raymondbernard commented 1 month ago

@RdoubleA Intel GPUs are supported in pytorch version 2.30 https://github.com/pytorch/pytorch?tab=readme-ov-file#intel-gpu-support

pytorch / torchtune

Support for Intel GPUs #999

/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py FAILED

Root Cause (first observed failure): [0]: time : 2024-05-17_11:41:43 host : idc-beta-batch-pvc-node-18 rank : 0 (local_rank: 0) exitcode : 1 (pid: 400561) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html