There are 16 barrier registers per sub-slice, so no more than 16 work-groups can be executed simultaneously.The amount of shared local memory available per sub-slice (64KB). If for example a work-group requires 32KB of shared local memory, only 2 of those work-groups can run concurrently, regardless of work-group size.
Intel GPUs Arch
general-purpose registers
架构图
DG1 Gen12.1 GPU
Arc GPU Gen 12.7
Intel GPUs Gen
架构层次
Slice
SubSlice
EU
ALU
Total Threads = SubSlices x EUs x Threads
Total Operations = Total Threads x SIMD Width
不同的 GPUs 参数
Memory
SLM
Shared Local Memory
q.get_device().get_info<sycl::info::device::local_mem_size>()
Bank conflicts
, 硬件限制会导致串行访问SLM 特性
Instruction Cache
Local Thread Dispatcher
Read-Only Texture/Image Sampler
- 64B/CycleDataport
- 64B/Cyclework-items
=Eus x Threads x SIMD Width
work-items
内共享atomic data
work-groups
< 16[[NDRange]] Mapping to iGPU
work-items
分配到 operationssub-group
分配到 threadswork-group
受 hardware 限制 (Iris Xe 每个 subslice 最多 16个,最多 512 个work-items
)Reference