pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
83.83k stars 22.6k forks source link

xpu: huggingface levit test_retain_grad_hidden_states_attentions test hangs on exit on PVC #136007

Closed dvrogozh closed 1 week ago

dvrogozh commented 1 month ago

With:

# apt-cache show intel-for-pytorch-gpu-dev-0.5 | grep Version | head -1
Version: 0.5.3-30
# apt-cache show intel-level-zero-gpu | grep Version | head -1
Version: 1.3.30049.10-950~22.04
# apt-cache show intel-opencl-icd | grep Version | head -1
Version: 24.26.30049.10-950~22.04
# uname -a
Linux willow-spr03.fm.intel.com 5.15.0-86-generic #96-Ubuntu SMP Wed Sep 20 08:23:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
# sudo apt-cache show intel-i915-dkms | grep Version | head -1
Version: 1.24.4.12.240603.18+i23-1

The following 2 tests (important: running them together!!) hang on the exit if executed on Intel Data Center GPU Max Series (PVC):

TRANSFORMERS_TEST_DEVICE_SPEC=spec.py python3 -m pytest \
    tests/models/levit/test_modeling_levit.py::LevitModelTest::test_hidden_states_output \
    tests/models/levit/test_modeling_levit.py::LevitModelTest::test_retain_grad_hidden_states_attentions

<...>
==== 2 passed, 2 warnings in 3.15s ===
^C    # <<<<< HANG here

Observations:

For the reference, my spec.py file below:

$ cat spec.py
import torch

# Specify the device name (eg. 'cuda', 'cpu', 'npu', 'xpu', 'mps')
DEVICE_NAME = 'xpu'

# Specify device-specific backends to dispatch to.
# If not specified, will fallback to 'default' in 'testing_utils.py`
MANUAL_SEED_FN = torch.xpu.manual_seed
EMPTY_CACHE_FN = torch.xpu.empty_cache
DEVICE_COUNT_FN = torch.xpu.device_count

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @gujinghui @EikanWang @fengyuan14 @guangyey

riverliuintel commented 1 month ago

@retonym please have a look this failure.

dvrogozh commented 1 month ago

@retonym : any update on this issue?

fengyuan14 commented 1 month ago

@dvrogozh Does the application use multiple XPU devices (2 tiles PVC)?

dvrogozh commented 1 month ago

@fengyuan14 : according to xpu-smi, no. See info below. By the way, note that I did execute the whole "tests/models/" and only this test (tests/models/levit/test_modeling_levit.py) hangs.

$ ls /dev/dri/
by-path  card0  card1  renderD128

$ cat /sys/class/drm/card0/device/device
0x2000  # that's ASPEED GPU, not Intel one

$ cat /sys/class/drm/card1/device/device
0x0bda

$ sudo xpu-smi discovery --dump -1 --json
{
    "device_list": [
        {
            "amc_firmware_name": "AMC",
            "amc_firmware_version": "6.5.0.0",
            "core_clock_rate_mhz": "1550",
            "device_id": 0,
            "device_name": "Intel(R) Data Center GPU Max 1100",
            "device_stepping": "B4",
            "device_type": "GPU",
            "driver_version": "I915_24.4.12_PSB_240603.18",
            "drm_device": "/dev/dri/card1",
            "gfx_data_firmware_name": "GFX_DATA",
            "gfx_data_firmware_version": "",
            "gfx_firmware_name": "GFX",
            "gfx_firmware_status": "normal",
            "gfx_firmware_version": "PVC2_1.23166",
            "gfx_pscbin_firmware_name": "GFX_PSCBIN",
            "gfx_pscbin_firmware_version": "0x12d21.0x20220830",
            "kernel_version": "5.15.0-86-generic",
            "max_command_queue_priority": "0",
            "max_hardware_contexts": "65536",
            "max_mem_alloc_size_byte": "48946688000",
            "memory_ecc_state": "enabled",
            "number_of_eus": "448",
            "number_of_eus_per_sub_slice": "8",
            "number_of_media_engines": "0",
            "number_of_media_enh_engines": "0",
            "number_of_slices": "1",
            "number_of_sub_slices_per_slice": "56",
            "number_of_threads_per_eu": "8",
            "number_of_tiles": "1",
            "pci_bdf_address": "0000:21:00.0",
            "pci_device_id": "0xbda",
            "pci_slot": "CPU0_PE2_PCIE0",
            "pci_vendor_id": "0x8086",
            "pcie_generation": "5",
            "pcie_max_link_width": "16",
            "physical_eu_simd_width": "16",
            "serial_number": "WTP232000351",
            "sku_type": "Production ES",
            "uuid": "00000000-0000-0000-fca0-22a7892457c1",
            "vendor_name": "Intel(R) Corporation",
            "xe_link_calibration_date": "Not Calibrated"
        }
    ]
dvrogozh commented 1 month ago

I retested on https://github.com/pytorch/pytorch/commit/5a29a06aa3c2dbb3afef730fbf9e7c0e2f308bf7 and probably I did a mistake in initial try out. As of now I see that hang is associated with a single test, not 2 tests as I initially thought. So, this hangs:

TRANSFORMERS_TEST_DEVICE_SPEC=spec.py python3 -m pytest \
    tests/models/levit/test_modeling_levit.py::LevitModelTest::test_retain_grad_hidden_states_attentions

I also created a shorter reproducer by extracting logic from above test. It occurs that hang is associated with the call of outputs[0].flatten()[0].backward(retain_graph=True). Commenting it out I do not see a hang.

$ cat repro.py
import random
import torch
from transformers import LevitConfig, LevitModel

config = LevitConfig(
    image_size=64,
    num_channels=3,
    kernel_size=3,
    stride=2,
    padding=1,
    patch_size=16,
    hidden_sizes=[16, 32, 48],
    num_attention_heads=[1, 2, 3],
    depths=[2, 3, 4],
    key_dim=[8, 8, 8],
    drop_path_rate=0,
    mlp_ratio=[2, 2, 2],
    attention_ratio=[2, 2, 2],
    initializer_range=0.02,
    down_ops=[['Subsample', 8, 2, 4, 2, 2], ['Subsample', 8, 4, 4, 2, 2]])

model = LevitModel(config).to("xpu")

def floats_tensor(shape, scale=1.0, rng=None, name=None):
    """Creates a random float32 tensor"""
    if rng is None:
        rng = random.Random()

    total_dims = 1
    for dim in shape:
        total_dims *= dim

    values = []
    for _ in range(total_dims):
        values.append(rng.random() * scale)

    return torch.tensor(data=values, dtype=torch.float, device="xpu").view(shape).contiguous()

inputs = {"pixel_values": floats_tensor([13, 3, 64, 64])}

outputs = model(**inputs, output_hidden_states=True)

last_hidden_states = outputs.last_hidden_state
print(list(last_hidden_states.shape))

hidden_states = outputs.hidden_states[0]
hidden_states.retain_grad()
print('>>> before backward()')
outputs[0].flatten()[0].backward(retain_graph=True)
print('>>> after backward()')

Output will be:

$ python3 repro.py
[13, 1, 48]
>>> before backward()
>>> after backward()
^C
dvrogozh commented 1 month ago

Using https://github.com/intel/pti-gpu/tree/master/tools/unitrace with unitrace -c python3 repro.py below is the last kernel enqueued. It's conv_reorder. However, it's not clear from this trace whether that's the kernel which hangs on gpu due to potential batching. Unfortunately MakeEachEnqueueBlocking=1 NEOReadDebugKeys=1 might not currently be working to guarantee that.

...
>>>> [1224250728124776] zeCommandListAppendLaunchKernel: hCommandList = 94519027971072 hKernel = 23247354279296 (conv_reorder) pLaunchFuncArgs = 23247486577624 {2, 9, 1} hSignalEvent = 23252087784416 numWaitEvents = 1 phWaitEvents = 23247198780800 (hWaitEvents = [23252080760416])
<<<< [1224250728128223] zeCommandListAppendLaunchKernel [1919 ns] hWaitEvents = 23252080760416 -> ZE_RESULT_SUCCESS(0x0)
dvrogozh commented 1 month ago

Using OverrideImmediateCmdListSynchronousMode=1 as it seems doing the trick to call kernels one by one, points to the hang calling at::native::xpu::BatchNormBackwardReduceChannelsLastKernelFunctor<4, float, float, float> kernel. I specifically add more log before the call to show how print out looks like on the kernel which finished execution.

Kernel is defined here: https://github.com/intel/torch-xpu-ops/blob/0ab67fbe746a5e9d80ffabcabf98e649d9504feb/src/ATen/native/xpu/sycl/BatchNormKernels.cpp#L1891

$ OverrideImmediateCmdListSynchronousMode=1 NEOReadDebugKeys=1 /home/dvrogozh/git/pti-gpu/tools/unitrace/_build/unitrace -c --demangle python3 repro.py
...
>>>> [1236133338262740] zeCommandListAppendLaunchKernel: hCommandList = 94458986970624 hKernel = 94459049017408 (at::native::xpu::VectorizedElementwiseKernel<4, at::native::xpu::FillFunctor<float>, at::detail::Array<char*, 1>, TrivialOffsetCalculator<0, unsigned int> >) pLaunchFuncArgs = 23379150699800 {1, 1, 1} hSignalEvent = 94459062284464 numWaitEvents = 1 phWaitEvents = 23383760654336 (hWaitEvents = [94459070029088])
<<<< [1236133338275392] zeCommandListAppendLaunchKernel [7094 ns] hWaitEvents = 94459070029088 -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338297944] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 0 argSize = 8 pArgValue = 23383759577984
<<<< [1236133338301713] zeKernelSetArgumentValue [1163 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338303325] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 1 argSize = 8 pArgValue = 23383759577992
<<<< [1236133338306550] zeKernelSetArgumentValue [177 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338308761] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 2 argSize = 8 pArgValue = 23383759578000
<<<< [1236133338310272] zeKernelSetArgumentValue [191 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338314476] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 3 argSize = 8 pArgValue = 23383759578008
<<<< [1236133338316658] zeKernelSetArgumentValue [186 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338317970] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 4 argSize = 8 pArgValue = 23383759578016
<<<< [1236133338319482] zeKernelSetArgumentValue [173 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338320919] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 5 argSize = 8 pArgValue = 23383759578024
<<<< [1236133338322373] zeKernelSetArgumentValue [151 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338323929] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 6 argSize = 8 pArgValue = 23383759578032
<<<< [1236133338327657] zeKernelSetArgumentValue [174 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338329671] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 7 argSize = 8 pArgValue = 23383759578040
<<<< [1236133338331629] zeKernelSetArgumentValue [118 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338332839] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 8 argSize = 8 pArgValue = 0
<<<< [1236133338336592] zeKernelSetArgumentValue [44 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338339049] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 9 argSize = 8 pArgValue = 0
<<<< [1236133338340095] zeKernelSetArgumentValue [36 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338341239] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 10 argSize = 4 pArgValue = 23383759578064
<<<< [1236133338343418] zeKernelSetArgumentValue [87 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338344657] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 11 argSize = 4 pArgValue = 23383759578068
<<<< [1236133338348571] zeKernelSetArgumentValue [52 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338350530] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 12 argSize = 1024 pArgValue = 0
<<<< [1236133338353992] zeKernelSetArgumentValue [102 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338356002] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 13 argSize = 1024 pArgValue = 0
<<<< [1236133338357156] zeKernelSetArgumentValue [48 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338359171] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 14 argSize = 1 pArgValue = 0
<<<< [1236133338362837] zeKernelSetArgumentValue [73 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338365218] zeKernelSetGroupSize: hKernel = 23383748151184 groupSizeX = 64 groupSizeY = 4 groupSizeZ = 1
<<<< [1236133338367996] zeKernelSetGroupSize [666 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338370307] zeEventHostReset: hEvent = 94459044917888
<<<< [1236133338371547] zeEventHostReset [127 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338374830] zeCommandListAppendLaunchKernel: hCommandList = 94458986970624 hKernel = 23383748151184 (at::native::xpu::BatchNormBackwardReduceChannelsLastKernelFunctor<4, float, float, float>) pLaunchFuncArgs = 23379150705384 {3, 1, 1} hSignalEvent = 94459044917888 numWaitEvents = 1 phWaitEvents = 23383759618608 (hWaitEvents = [94459062284464])
fengyuan14 commented 1 month ago

@xytintel Please take a look at the hang after launching at::native::xpu::BatchNormBackwardReduceChannelsLastKernelFunctor<4, float, float, float>?

dvrogozh commented 1 month ago

https://github.com/intel/torch-xpu-ops/pull/940

This PR did not help on my side to fix the issue. This might due to different GPU, for me that's Max 1100, for @xytintel that's 1500

Gradually commenting out the kernel code, I isolated the hang to this item.barrier() call:

https://github.com/intel/torch-xpu-ops/blob/229a887eaff94557a1960eb6d74d19014e5db289/src/ATen/native/xpu/sycl/BatchNormKernels.cpp#L1877

dvrogozh commented 1 month ago

@xytintel : On my side hang is associated with this early exit from the kernel: https://github.com/intel/torch-xpu-ops/blob/151a19e10bea64a3484bd36ae70476c5c5e289ed/src/ATen/native/xpu/sycl/BatchNormKernels.cpp#L1913 If synchronization barrier with exit (see below) is added before this check, there is no hang. If added afterwards - there is a hang.

    //item.barrier(sycl_local_fence); return;  // NO HANG
    if (c_offset >= stride_ || m_offset >= reduction_size_) {
      return;
    }
    item.barrier(sycl_local_fence); return;  // HANG

Interesting that CUDA kernel is basically the same, it has same check and similar synchronization code. I wonder, is there a difference on how synchronization works in CUDA and SYCL? does CUDA synchronizes all remaining threads while SYCL synchronizes all initially scheduled threads and since some exited earlier we get a hang?

https://github.com/pytorch/pytorch/blob/156ca01e51f766b1b069c5c6f3d57112a5c8f9ff/aten/src/ATen/native/cuda/Normalization.cuh#L1199

dvrogozh commented 1 month ago

Ok, looked into documentation and had few offline discussion. Summary below.

From the above, I think there are only 2 choices to resolve the issue:

  1. Modify kernel logic by removing the early exit and making sure that barriers will be hit by all threads. This probably is the best option, but it comes with the cost to applying a number of conditional checks to not execute non-barrier code on the "early exit threads". Unfortunately we have the kernel with multiple barrier calls some of which are inside for loops. Anyhow, code should be similar to:
    if (condition) { .. .}
    item.barrier()
    if (condition) { .. .}
  2. Modify kernel scheduling logic making sure that there will be no threads which will hit early exit. This will change a pattern how threads are scheduled and might have performance draw back.

CC: @Pennycook CC: @xytintel @EikanWang @fengyuan14 @guangyey @jgong5

dvrogozh commented 1 month ago

@xytintel : Thank you for the quick update of https://github.com/intel/torch-xpu-ops/pull/940. Last version does fix the issue on my side. See details in https://github.com/intel/torch-xpu-ops/pull/940#pullrequestreview-2357880796.

EikanWang commented 1 week ago

@dvrogozh , the issue should have been resolved on PyTorch main. I will close this issue, and pls. feel free to reopen it if the issue is still there.

dvrogozh commented 1 week ago

Yes, this seems landed to pytorch main. Fine to close.