Closed dvrogozh closed 1 week ago
@retonym please have a look this failure.
@retonym : any update on this issue?
@dvrogozh Does the application use multiple XPU devices (2 tiles PVC)?
@fengyuan14 : according to xpu-smi, no. See info below. By the way, note that I did execute the whole "tests/models/" and only this test (tests/models/levit/test_modeling_levit.py
) hangs.
$ ls /dev/dri/
by-path card0 card1 renderD128
$ cat /sys/class/drm/card0/device/device
0x2000 # that's ASPEED GPU, not Intel one
$ cat /sys/class/drm/card1/device/device
0x0bda
$ sudo xpu-smi discovery --dump -1 --json
{
"device_list": [
{
"amc_firmware_name": "AMC",
"amc_firmware_version": "6.5.0.0",
"core_clock_rate_mhz": "1550",
"device_id": 0,
"device_name": "Intel(R) Data Center GPU Max 1100",
"device_stepping": "B4",
"device_type": "GPU",
"driver_version": "I915_24.4.12_PSB_240603.18",
"drm_device": "/dev/dri/card1",
"gfx_data_firmware_name": "GFX_DATA",
"gfx_data_firmware_version": "",
"gfx_firmware_name": "GFX",
"gfx_firmware_status": "normal",
"gfx_firmware_version": "PVC2_1.23166",
"gfx_pscbin_firmware_name": "GFX_PSCBIN",
"gfx_pscbin_firmware_version": "0x12d21.0x20220830",
"kernel_version": "5.15.0-86-generic",
"max_command_queue_priority": "0",
"max_hardware_contexts": "65536",
"max_mem_alloc_size_byte": "48946688000",
"memory_ecc_state": "enabled",
"number_of_eus": "448",
"number_of_eus_per_sub_slice": "8",
"number_of_media_engines": "0",
"number_of_media_enh_engines": "0",
"number_of_slices": "1",
"number_of_sub_slices_per_slice": "56",
"number_of_threads_per_eu": "8",
"number_of_tiles": "1",
"pci_bdf_address": "0000:21:00.0",
"pci_device_id": "0xbda",
"pci_slot": "CPU0_PE2_PCIE0",
"pci_vendor_id": "0x8086",
"pcie_generation": "5",
"pcie_max_link_width": "16",
"physical_eu_simd_width": "16",
"serial_number": "WTP232000351",
"sku_type": "Production ES",
"uuid": "00000000-0000-0000-fca0-22a7892457c1",
"vendor_name": "Intel(R) Corporation",
"xe_link_calibration_date": "Not Calibrated"
}
]
I retested on https://github.com/pytorch/pytorch/commit/5a29a06aa3c2dbb3afef730fbf9e7c0e2f308bf7 and probably I did a mistake in initial try out. As of now I see that hang is associated with a single test, not 2 tests as I initially thought. So, this hangs:
TRANSFORMERS_TEST_DEVICE_SPEC=spec.py python3 -m pytest \
tests/models/levit/test_modeling_levit.py::LevitModelTest::test_retain_grad_hidden_states_attentions
I also created a shorter reproducer by extracting logic from above test. It occurs that hang is associated with the call of outputs[0].flatten()[0].backward(retain_graph=True)
. Commenting it out I do not see a hang.
$ cat repro.py
import random
import torch
from transformers import LevitConfig, LevitModel
config = LevitConfig(
image_size=64,
num_channels=3,
kernel_size=3,
stride=2,
padding=1,
patch_size=16,
hidden_sizes=[16, 32, 48],
num_attention_heads=[1, 2, 3],
depths=[2, 3, 4],
key_dim=[8, 8, 8],
drop_path_rate=0,
mlp_ratio=[2, 2, 2],
attention_ratio=[2, 2, 2],
initializer_range=0.02,
down_ops=[['Subsample', 8, 2, 4, 2, 2], ['Subsample', 8, 4, 4, 2, 2]])
model = LevitModel(config).to("xpu")
def floats_tensor(shape, scale=1.0, rng=None, name=None):
"""Creates a random float32 tensor"""
if rng is None:
rng = random.Random()
total_dims = 1
for dim in shape:
total_dims *= dim
values = []
for _ in range(total_dims):
values.append(rng.random() * scale)
return torch.tensor(data=values, dtype=torch.float, device="xpu").view(shape).contiguous()
inputs = {"pixel_values": floats_tensor([13, 3, 64, 64])}
outputs = model(**inputs, output_hidden_states=True)
last_hidden_states = outputs.last_hidden_state
print(list(last_hidden_states.shape))
hidden_states = outputs.hidden_states[0]
hidden_states.retain_grad()
print('>>> before backward()')
outputs[0].flatten()[0].backward(retain_graph=True)
print('>>> after backward()')
Output will be:
$ python3 repro.py
[13, 1, 48]
>>> before backward()
>>> after backward()
^C
Using https://github.com/intel/pti-gpu/tree/master/tools/unitrace with unitrace -c python3 repro.py
below is the last kernel enqueued. It's conv_reorder
. However, it's not clear from this trace whether that's the kernel which hangs on gpu due to potential batching. Unfortunately MakeEachEnqueueBlocking=1 NEOReadDebugKeys=1
might not currently be working to guarantee that.
...
>>>> [1224250728124776] zeCommandListAppendLaunchKernel: hCommandList = 94519027971072 hKernel = 23247354279296 (conv_reorder) pLaunchFuncArgs = 23247486577624 {2, 9, 1} hSignalEvent = 23252087784416 numWaitEvents = 1 phWaitEvents = 23247198780800 (hWaitEvents = [23252080760416])
<<<< [1224250728128223] zeCommandListAppendLaunchKernel [1919 ns] hWaitEvents = 23252080760416 -> ZE_RESULT_SUCCESS(0x0)
Using OverrideImmediateCmdListSynchronousMode=1
as it seems doing the trick to call kernels one by one, points to the hang calling at::native::xpu::BatchNormBackwardReduceChannelsLastKernelFunctor<4, float, float, float>
kernel. I specifically add more log before the call to show how print out looks like on the kernel which finished execution.
Kernel is defined here: https://github.com/intel/torch-xpu-ops/blob/0ab67fbe746a5e9d80ffabcabf98e649d9504feb/src/ATen/native/xpu/sycl/BatchNormKernels.cpp#L1891
$ OverrideImmediateCmdListSynchronousMode=1 NEOReadDebugKeys=1 /home/dvrogozh/git/pti-gpu/tools/unitrace/_build/unitrace -c --demangle python3 repro.py
...
>>>> [1236133338262740] zeCommandListAppendLaunchKernel: hCommandList = 94458986970624 hKernel = 94459049017408 (at::native::xpu::VectorizedElementwiseKernel<4, at::native::xpu::FillFunctor<float>, at::detail::Array<char*, 1>, TrivialOffsetCalculator<0, unsigned int> >) pLaunchFuncArgs = 23379150699800 {1, 1, 1} hSignalEvent = 94459062284464 numWaitEvents = 1 phWaitEvents = 23383760654336 (hWaitEvents = [94459070029088])
<<<< [1236133338275392] zeCommandListAppendLaunchKernel [7094 ns] hWaitEvents = 94459070029088 -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338297944] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 0 argSize = 8 pArgValue = 23383759577984
<<<< [1236133338301713] zeKernelSetArgumentValue [1163 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338303325] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 1 argSize = 8 pArgValue = 23383759577992
<<<< [1236133338306550] zeKernelSetArgumentValue [177 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338308761] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 2 argSize = 8 pArgValue = 23383759578000
<<<< [1236133338310272] zeKernelSetArgumentValue [191 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338314476] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 3 argSize = 8 pArgValue = 23383759578008
<<<< [1236133338316658] zeKernelSetArgumentValue [186 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338317970] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 4 argSize = 8 pArgValue = 23383759578016
<<<< [1236133338319482] zeKernelSetArgumentValue [173 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338320919] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 5 argSize = 8 pArgValue = 23383759578024
<<<< [1236133338322373] zeKernelSetArgumentValue [151 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338323929] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 6 argSize = 8 pArgValue = 23383759578032
<<<< [1236133338327657] zeKernelSetArgumentValue [174 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338329671] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 7 argSize = 8 pArgValue = 23383759578040
<<<< [1236133338331629] zeKernelSetArgumentValue [118 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338332839] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 8 argSize = 8 pArgValue = 0
<<<< [1236133338336592] zeKernelSetArgumentValue [44 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338339049] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 9 argSize = 8 pArgValue = 0
<<<< [1236133338340095] zeKernelSetArgumentValue [36 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338341239] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 10 argSize = 4 pArgValue = 23383759578064
<<<< [1236133338343418] zeKernelSetArgumentValue [87 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338344657] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 11 argSize = 4 pArgValue = 23383759578068
<<<< [1236133338348571] zeKernelSetArgumentValue [52 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338350530] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 12 argSize = 1024 pArgValue = 0
<<<< [1236133338353992] zeKernelSetArgumentValue [102 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338356002] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 13 argSize = 1024 pArgValue = 0
<<<< [1236133338357156] zeKernelSetArgumentValue [48 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338359171] zeKernelSetArgumentValue: hKernel = 23383748151184 argIndex = 14 argSize = 1 pArgValue = 0
<<<< [1236133338362837] zeKernelSetArgumentValue [73 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338365218] zeKernelSetGroupSize: hKernel = 23383748151184 groupSizeX = 64 groupSizeY = 4 groupSizeZ = 1
<<<< [1236133338367996] zeKernelSetGroupSize [666 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338370307] zeEventHostReset: hEvent = 94459044917888
<<<< [1236133338371547] zeEventHostReset [127 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1236133338374830] zeCommandListAppendLaunchKernel: hCommandList = 94458986970624 hKernel = 23383748151184 (at::native::xpu::BatchNormBackwardReduceChannelsLastKernelFunctor<4, float, float, float>) pLaunchFuncArgs = 23379150705384 {3, 1, 1} hSignalEvent = 94459044917888 numWaitEvents = 1 phWaitEvents = 23383759618608 (hWaitEvents = [94459062284464])
@xytintel Please take a look at the hang after launching at::native::xpu::BatchNormBackwardReduceChannelsLastKernelFunctor<4, float, float, float>
?
This PR did not help on my side to fix the issue. This might due to different GPU, for me that's Max 1100, for @xytintel that's 1500
Gradually commenting out the kernel code, I isolated the hang to this item.barrier()
call:
@xytintel : On my side hang is associated with this early exit from the kernel: https://github.com/intel/torch-xpu-ops/blob/151a19e10bea64a3484bd36ae70476c5c5e289ed/src/ATen/native/xpu/sycl/BatchNormKernels.cpp#L1913 If synchronization barrier with exit (see below) is added before this check, there is no hang. If added afterwards - there is a hang.
//item.barrier(sycl_local_fence); return; // NO HANG
if (c_offset >= stride_ || m_offset >= reduction_size_) {
return;
}
item.barrier(sycl_local_fence); return; // HANG
Interesting that CUDA kernel is basically the same, it has same check and similar synchronization code. I wonder, is there a difference on how synchronization works in CUDA and SYCL? does CUDA synchronizes all remaining threads while SYCL synchronizes all initially scheduled threads and since some exited earlier we get a hang?
Ok, looked into documentation and had few offline discussion. Summary below.
__syncthread()
barrier does not consider exited threads, i.e. early exit is allowed. See here: _"Starting with Volta, the CUDA built-in __syncthreads()
and PTX instruction bar.sync (and their derivatives) are enforced per thread and thus will not succeed until reached by all non-exited threads in the block"_nd::item::barrier()
strongly requires all scheduled threads to arrive to barrier otherwise there will be hang. Some fixes to address this are considered, but they require HW changes which don't exist on PVC. Bottom line - we need changes in kernel or threads scheduling logic to address this on Arc/PVC in any case. On some later platform nd::item::barrier()
might just work and this case might be handled with diverged (per platform) kernel logic.From the above, I think there are only 2 choices to resolve the issue:
if (condition) { .. .}
item.barrier()
if (condition) { .. .}
CC: @Pennycook CC: @xytintel @EikanWang @fengyuan14 @guangyey @jgong5
@xytintel : Thank you for the quick update of https://github.com/intel/torch-xpu-ops/pull/940. Last version does fix the issue on my side. See details in https://github.com/intel/torch-xpu-ops/pull/940#pullrequestreview-2357880796.
@dvrogozh , the issue should have been resolved on PyTorch main. I will close this issue, and pls. feel free to reopen it if the issue is still there.
Yes, this seems landed to pytorch main. Fine to close.
With:
The following 2 tests (important: running them together!!) hang on the exit if executed on Intel Data Center GPU Max Series (PVC):
Observations:
For the reference, my spec.py file below:
CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @gujinghui @EikanWang @fengyuan14 @guangyey