xpu: huggingface levit test_retain_grad_hidden_states_attentions test hangs on exit on PVC

dvrogozh commented 1 month ago

With:

# apt-cache show intel-for-pytorch-gpu-dev-0.5 | grep Version | head -1
Version: 0.5.3-30
# apt-cache show intel-level-zero-gpu | grep Version | head -1
Version: 1.3.30049.10-950~22.04
# apt-cache show intel-opencl-icd | grep Version | head -1
Version: 24.26.30049.10-950~22.04
# uname -a
Linux willow-spr03.fm.intel.com 5.15.0-86-generic #96-Ubuntu SMP Wed Sep 20 08:23:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
# sudo apt-cache show intel-i915-dkms | grep Version | head -1
Version: 1.24.4.12.240603.18+i23-1

The following 2 tests (important: running them together!!) hang on the exit if executed on Intel Data Center GPU Max Series (PVC):

TRANSFORMERS_TEST_DEVICE_SPEC=spec.py python3 -m pytest \
    tests/models/levit/test_modeling_levit.py::LevitModelTest::test_hidden_states_output \
    tests/models/levit/test_modeling_levit.py::LevitModelTest::test_retain_grad_hidden_states_attentions

<...>
==== 2 passed, 2 warnings in 3.15s ===
^C    # <<<<< HANG here

Observations:

There is no hang on exit if tests executed separately. Hang happens if they are executed together in a single run
GPU is clear executing something - I see significant load on GPU engines in intel_gpu_top

Under debugger here is a stack on the exit:

#0  0x00007ffff7d62c9b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1  0x00007ff858c59e5b in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#2  0x00007ff858c4fc68 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#3  0x00007ffdb835a679 in ur_queue_handle_t_::synchronize() () from /opt/intel/oneapi/compiler/2024.1/lib/libpi_level_zero.so
#4  0x00007ffdb83598d9 in urQueueRelease () from /opt/intel/oneapi/compiler/2024.1/lib/libpi_level_zero.so
#5  0x00007ffdb8365a7b in piQueueRelease () from /opt/intel/oneapi/compiler/2024.1/lib/libpi_level_zero.so
#6  0x00007fff3a89e927 in _pi_result sycl::_V1::detail::plugin::call_nocheck<(sycl::_V1::detail::PiApiKind)26, _pi_queue*>(_pi_queue*) const ()
from /opt/intel/oneapi/pytorch-gpu-dev-0.5/lib/libsycl-preview.so.7
#7  0x00007fff3a89e0dc in sycl::_V1::detail::queue_impl::~queue_impl() () from /opt/intel/oneapi/pytorch-gpu-dev-0.5/lib/libsycl-preview.so.7
#8  0x00007fff3ab506bf in std::vector<std::array<std::array<std::unique_ptr<sycl::_V1::queue, std::default_delete<sycl::_V1::queue> >, 32ul>, 2ul>, std::allocator<std::array<std::array<std::unique_ptr<sycl::_V1::queue, std::default_delete<sycl::_V1::queue> >, 32ul>, 2ul> > >::~vector() ()
from /home/dvrogozh/pytorch.xpu/lib/python3.10/site-packages/torch/lib/libc10_xpu.so
#9  0x00007ffff7c9f495 in __run_exit_handlers (status=0, listp=0x7ffff7e74838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true,
run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
#10 0x00007ffff7c9f610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#11 0x00007ffff7c83d97 in __libc_start_call_main (main=main@entry=0x55555577cdc0, argc=argc@entry=7, argv=argv@entry=0x7fffffffd798)
at ../sysdeps/nptl/libc_start_call_main.h:74
#12 0x00007ffff7c83e40 in __libc_start_main_impl (main=0x55555577cdc0, argc=7, argv=0x7fffffffd798, init=<optimized out>, fini=<optimized out>,
rtld_fini=<optimized out>, stack_end=0x7fffffffd788) at ../csu/libc-start.c:392
#13 0x000055555577ccf5 in _start ()

For the reference, my spec.py file below:

$ cat spec.py
import torch

# Specify the device name (eg. 'cuda', 'cpu', 'npu', 'xpu', 'mps')
DEVICE_NAME = 'xpu'

# Specify device-specific backends to dispatch to.
# If not specified, will fallback to 'default' in 'testing_utils.py`
MANUAL_SEED_FN = torch.xpu.manual_seed
EMPTY_CACHE_FN = torch.xpu.empty_cache
DEVICE_COUNT_FN = torch.xpu.device_count

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @gujinghui @EikanWang @fengyuan14 @guangyey

riverliuintel commented 1 month ago

@retonym please have a look this failure.

dvrogozh commented 1 month ago

@retonym : any update on this issue?

fengyuan14 commented 1 month ago

@dvrogozh Does the application use multiple XPU devices (2 tiles PVC)?

dvrogozh commented 1 month ago

@fengyuan14 : according to xpu-smi, no. See info below. By the way, note that I did execute the whole "tests/models/" and only this test (tests/models/levit/test_modeling_levit.py) hangs.

$ ls /dev/dri/
by-path  card0  card1  renderD128

$ cat /sys/class/drm/card0/device/device
0x2000  # that's ASPEED GPU, not Intel one

$ cat /sys/class/drm/card1/device/device
0x0bda

$ sudo xpu-smi discovery --dump -1 --json
{
    "device_list": [
        {
            "amc_firmware_name": "AMC",
            "amc_firmware_version": "6.5.0.0",
            "core_clock_rate_mhz": "1550",
            "device_id": 0,
            "device_name": "Intel(R) Data Center GPU Max 1100",
            "device_stepping": "B4",
            "device_type": "GPU",
            "driver_version": "I915_24.4.12_PSB_240603.18",
            "drm_device": "/dev/dri/card1",
            "gfx_data_firmware_name": "GFX_DATA",
            "gfx_data_firmware_version": "",
            "gfx_firmware_name": "GFX",
            "gfx_firmware_status": "normal",
            "gfx_firmware_version": "PVC2_1.23166",
            "gfx_pscbin_firmware_name": "GFX_PSCBIN",
            "gfx_pscbin_firmware_version": "0x12d21.0x20220830",
            "kernel_version": "5.15.0-86-generic",
            "max_command_queue_priority": "0",
            "max_hardware_contexts": "65536",
            "max_mem_alloc_size_byte": "48946688000",
            "memory_ecc_state": "enabled",
            "number_of_eus": "448",
            "number_of_eus_per_sub_slice": "8",
            "number_of_media_engines": "0",
            "number_of_media_enh_engines": "0",
            "number_of_slices": "1",
            "number_of_sub_slices_per_slice": "56",
            "number_of_threads_per_eu": "8",
            "number_of_tiles": "1",
            "pci_bdf_address": "0000:21:00.0",
            "pci_device_id": "0xbda",
            "pci_slot": "CPU0_PE2_PCIE0",
            "pci_vendor_id": "0x8086",
            "pcie_generation": "5",
            "pcie_max_link_width": "16",
            "physical_eu_simd_width": "16",
            "serial_number": "WTP232000351",
            "sku_type": "Production ES",
            "uuid": "00000000-0000-0000-fca0-22a7892457c1",
            "vendor_name": "Intel(R) Corporation",
            "xe_link_calibration_date": "Not Calibrated"
        }
    ]

dvrogozh commented 1 month ago

I retested on https://github.com/pytorch/pytorch/commit/5a29a06aa3c2dbb3afef730fbf9e7c0e2f308bf7 and probably I did a mistake in initial try out. As of now I see that hang is associated with a single test, not 2 tests as I initially thought. So, this hangs:

TRANSFORMERS_TEST_DEVICE_SPEC=spec.py python3 -m pytest \
    tests/models/levit/test_modeling_levit.py::LevitModelTest::test_retain_grad_hidden_states_attentions

I also created a shorter reproducer by extracting logic from above test. It occurs that hang is associated with the call of outputs[0].flatten()[0].backward(retain_graph=True). Commenting it out I do not see a hang.

$ cat repro.py
import random
import torch
from transformers import LevitConfig, LevitModel

config = LevitConfig(
    image_size=64,
    num_channels=3,
    kernel_size=3,
    stride=2,
    padding=1,
    patch_size=16,
    hidden_sizes=[16, 32, 48],
    num_attention_heads=[1, 2, 3],
    depths=[2, 3, 4],
    key_dim=[8, 8, 8],
    drop_path_rate=0,
    mlp_ratio=[2, 2, 2],
    attention_ratio=[2, 2, 2],
    initializer_range=0.02,
    down_ops=[['Subsample', 8, 2, 4, 2, 2], ['Subsample', 8, 4, 4, 2, 2]])

model = LevitModel(config).to("xpu")

def floats_tensor(shape, scale=1.0, rng=None, name=None):
    """Creates a random float32 tensor"""
    if rng is None:
        rng = random.Random()

    total_dims = 1
    for dim in shape:
        total_dims *= dim

    values = []
    for _ in range(total_dims):
        values.append(rng.random() * scale)

    return torch.tensor(data=values, dtype=torch.float, device="xpu").view(shape).contiguous()

inputs = {"pixel_values": floats_tensor([13, 3, 64, 64])}

outputs = model(**inputs, output_hidden_states=True)

last_hidden_states = outputs.last_hidden_state
print(list(last_hidden_states.shape))

hidden_states = outputs.hidden_states[0]
hidden_states.retain_grad()
print('>>> before backward()')
outputs[0].flatten()[0].backward(retain_graph=True)
print('>>> after backward()')

Output will be:

$ python3 repro.py
[13, 1, 48]
>>> before backward()
>>> after backward()
^C

dvrogozh commented 1 month ago

Using https://github.com/intel/pti-gpu/tree/master/tools/unitrace with unitrace -c python3 repro.py below is the last kernel enqueued. It's conv_reorder. However, it's not clear from this trace whether that's the kernel which hangs on gpu due to potential batching. Unfortunately MakeEachEnqueueBlocking=1 NEOReadDebugKeys=1 might not currently be working to guarantee that.

...
>>>> [1224250728124776] zeCommandListAppendLaunchKernel: hCommandList = 94519027971072 hKernel = 23247354279296 (conv_reorder) pLaunchFuncArgs = 23247486577624 {2, 9, 1} hSignalEvent = 23252087784416 numWaitEvents = 1 phWaitEvents = 23247198780800 (hWaitEvents = [23252080760416])
<<<< [1224250728128223] zeCommandListAppendLaunchKernel [1919 ns] hWaitEvents = 23252080760416 -> ZE_RESULT_SUCCESS(0x0)

dvrogozh commented 1 month ago

Using OverrideImmediateCmdListSynchronousMode=1 as it seems doing the trick to call kernels one by one, points to the hang calling at::native::xpu::BatchNormBackwardReduceChannelsLastKernelFunctor<4, float, float, float> kernel. I specifically add more log before the call to show how print out looks like on the kernel which finished execution.

Kernel is defined here: https://github.com/intel/torch-xpu-ops/blob/0ab67fbe746a5e9d80ffabcabf98e649d9504feb/src/ATen/native/xpu/sycl/BatchNormKernels.cpp#L1891

$ OverrideImmediateCmdListSynchronousMode=1 NEOReadDebugKeys=1 /home/dvrogozh/git/pti-gpu/tools/unitrace/_build/unitrace -c --demangle python3 repro.py
...
>>>> [1236133338262740] zeCommandListAppendLaunchKernel: hCommandList = 94458986970624 hKernel = 94459049017408 (at::native::xpu::VectorizedElementwiseKernel<4, at::native::xpu::FillFunctor<float>, at::detail::Array<char*, 1>, TrivialOffsetCalculator<0, unsigned int> >) pLaunchFuncArgs = 23379150699800 {1, 1, 1} hSignalEvent = 94459062284464 numWaitEvents = 1 phWaitEvents = 23383760654336 (hWaitEvents = [94459070029088])
<<<< [1236133338275392] zeCommandListAppendLaunchKernel [7094 ns] hWaitEvents = 94459070029088 -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338297944] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 0 argSize = 8 pArgValue = 23383759577984
<<<< [1236133338301713] zeKernelSetArgumentValue [1163 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338303325] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 1 argSize = 8 pArgValue = 23383759577992
<<<< [1236133338306550] zeKernelSetArgumentValue [177 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338308761] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 2 argSize = 8 pArgValue = 23383759578000
<<<< [1236133338310272] zeKernelSetArgumentValue [191 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338314476] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 3 argSize = 8 pArgValue = 23383759578008
<<<< [1236133338316658] zeKernelSetArgumentValue [186 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338317970] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 4 argSize = 8 pArgValue = 23383759578016
<<<< [1236133338319482] zeKernelSetArgumentValue [173 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338320919] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 5 argSize = 8 pArgValue = 23383759578024
<<<< [1236133338322373] zeKernelSetArgumentValue [151 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338323929] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 6 argSize = 8 pArgValue = 23383759578032
<<<< [1236133338327657] zeKernelSetArgumentValue [174 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338329671] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 7 argSize = 8 pArgValue = 23383759578040
<<<< [1236133338331629] zeKernelSetArgumentValue [118 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338332839] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 8 argSize = 8 pArgValue = 0
<<<< [1236133338336592] zeKernelSetArgumentValue [44 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338339049] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 9 argSize = 8 pArgValue = 0
<<<< [1236133338340095] zeKernelSetArgumentValue [36 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338341239] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 10 argSize = 4 pArgValue = 23383759578064
<<<< [1236133338343418] zeKernelSetArgumentValue [87 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338344657] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 11 argSize = 4 pArgValue = 23383759578068
<<<< [1236133338348571] zeKernelSetArgumentValue [52 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338350530] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 12 argSize = 1024 pArgValue = 0
<<<< [1236133338353992] zeKernelSetArgumentValue [102 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338356002] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 13 argSize = 1024 pArgValue = 0
<<<< [1236133338357156] zeKernelSetArgumentValue [48 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338359171] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 14 argSize = 1 pArgValue = 0
<<<< [1236133338362837] zeKernelSetArgumentValue [73 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338365218] zeKernelSetGroupSize: hKernel = 23383748151184 groupSizeX = 64 groupSizeY = 4 groupSizeZ = 1
<<<< [1236133338367996] zeKernelSetGroupSize [666 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338370307] zeEventHostReset: hEvent = 94459044917888
<<<< [1236133338371547] zeEventHostReset [127 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338374830] zeCommandListAppendLaunchKernel: hCommandList = 94458986970624 hKernel = 23383748151184 (at::native::xpu::BatchNormBackwardReduceChannelsLastKernelFunctor<4, float, float, float>) pLaunchFuncArgs = 23379150705384 {3, 1, 1} hSignalEvent = 94459044917888 numWaitEvents = 1 phWaitEvents = 23383759618608 (hWaitEvents = [94459062284464])

fengyuan14 commented 1 month ago

@xytintel Please take a look at the hang after launching at::native::xpu::BatchNormBackwardReduceChannelsLastKernelFunctor<4, float, float, float>?

dvrogozh commented 1 month ago

https://github.com/intel/torch-xpu-ops/pull/940

This PR did not help on my side to fix the issue. This might due to different GPU, for me that's Max 1100, for @xytintel that's 1500

Gradually commenting out the kernel code, I isolated the hang to this item.barrier() call:

https://github.com/intel/torch-xpu-ops/blob/229a887eaff94557a1960eb6d74d19014e5db289/src/ATen/native/xpu/sycl/BatchNormKernels.cpp#L1877

dvrogozh commented 1 month ago

@xytintel : On my side hang is associated with this early exit from the kernel: https://github.com/intel/torch-xpu-ops/blob/151a19e10bea64a3484bd36ae70476c5c5e289ed/src/ATen/native/xpu/sycl/BatchNormKernels.cpp#L1913 If synchronization barrier with exit (see below) is added before this check, there is no hang. If added afterwards - there is a hang.

    //item.barrier(sycl_local_fence); return;  // NO HANG
    if (c_offset >= stride_ || m_offset >= reduction_size_) {
      return;
    }
    item.barrier(sycl_local_fence); return;  // HANG

Interesting that CUDA kernel is basically the same, it has same check and similar synchronization code. I wonder, is there a difference on how synchronization works in CUDA and SYCL? does CUDA synchronizes all remaining threads while SYCL synchronizes all initially scheduled threads and since some exited earlier we get a hang?

https://github.com/pytorch/pytorch/blob/156ca01e51f766b1b069c5c6f3d57112a5c8f9ff/aten/src/ATen/native/cuda/Normalization.cuh#L1199

dvrogozh commented 1 month ago

Ok, looked into documentation and had few offline discussion. Summary below.

On NVidia GPUs __syncthread() barrier does not consider exited threads, i.e. early exit is allowed. See here: _"Starting with Volta, the CUDA built-in __syncthreads() and PTX instruction bar.sync (and their derivatives) are enforced per thread and thus will not succeed until reached by all non-exited threads in the block"_
SYCL nd::item::barrier() strongly requires all scheduled threads to arrive to barrier otherwise there will be hang. Some fixes to address this are considered, but they require HW changes which don't exist on PVC. Bottom line - we need changes in kernel or threads scheduling logic to address this on Arc/PVC in any case. On some later platform nd::item::barrier() might just work and this case might be handled with diverged (per platform) kernel logic.
Logic to calculate work group size is basically the same between XPU and CUDA. Important part here is that sizes are not aligned with input stride and reduction_size, so the early exit condition is rightfully hit.
There is a proposal for the named SYCL barrier, see https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_barrier.asciidoc. But it is 1) not currently implemented, 2) might come with performance hit on current HW. Thus, can't be used now and need to be careful in the future.

From the above, I think there are only 2 choices to resolve the issue:

Modify kernel logic by removing the early exit and making sure that barriers will be hit by all threads. This probably is the best option, but it comes with the cost to applying a number of conditional checks to not execute non-barrier code on the "early exit threads". Unfortunately we have the kernel with multiple barrier calls some of which are inside for loops. Anyhow, code should be similar to:
```
if (condition) { .. .}
item.barrier()
if (condition) { .. .}
```
Modify kernel scheduling logic making sure that there will be no threads which will hit early exit. This will change a pattern how threads are scheduled and might have performance draw back.

CC: @Pennycook CC: @xytintel @EikanWang @fengyuan14 @guangyey @jgong5

dvrogozh commented 1 month ago

@xytintel : Thank you for the quick update of https://github.com/intel/torch-xpu-ops/pull/940. Last version does fix the issue on my side. See details in https://github.com/intel/torch-xpu-ops/pull/940#pullrequestreview-2357880796.

EikanWang commented 1 week ago

@dvrogozh , the issue should have been resolved on PyTorch main. I will close this issue, and pls. feel free to reopen it if the issue is still there.

dvrogozh commented 1 week ago

Yes, this seems landed to pytorch main. Fine to close.

pytorch / pytorch

xpu: huggingface levit test_retain_grad_hidden_states_attentions test hangs on exit on PVC #136007