microsoft / tensorflow-directml

Fork of TensorFlow accelerated by DirectML
Apache License 2.0
457 stars 32 forks source link

Support tiled resources in primary DML allocator #314

Closed jstoecker closed 3 years ago

jstoecker commented 3 years ago

Ideally this change would be split into smaller ones, so I apologize in advanced for the large size.

The major changes:

  1. The primary DML suballocator (D3D12HeapAllocator) now supports allocations that are backed by multiple heaps (tiling) to better utilize local video memory and avoid demotion to non-local memory. The previous behavior of using contiguous heaps and placed resources is retained as a fallback for hardware that doesn't support tiled buffers.

  2. The D3D12HeapAllocator no longer maintains a pool of resources for each allocation; instead, it has three otherwise identical resources that remain in fixed states: UAV, COPY_SRC, and COPY_DST. This change is important for change No. 1 because while placed resources are cheap, reserved resources can consume significant memory with their unique page tables. Consequently, D3D12BufferRegions no longer take ownership of pooled resources and are now lightweight wrappers that help manage access to memory regions without kernels having to deal with resource state transitions directly.

  3. Significant refactoring of the DML kernel context to move various helpers to a single interface (DmlDeviceContext). We previously had kernels calling into DmlKernelContext, DmlKernelConstruction, dml_util, DmlExecutionContext, DmlDeviceContext, and the upload/readback pooled allocators. This makes is extremely tedious to refactor any code involving allocations or copies without touching dozens of files. Going forward kernel code should go straight to the DmlDeviceContext for anything related to copies or allocation. It's important to transition most code away from using raw ID3D12Resource objects and offsets and instead using D3D12BufferRegion now that resources aren't pooled: it would be easy to use the wrong ID3D12Resource handle wrapped by this helper.

  4. Memory growth is turned on by default for all adapters, not just UMA adapters. It's possible to go back to the GPU device policy by setting the TF_GPU_FORCE_ALLOW_GROWTH environment variable to "false". Memory growth isn't crucial now with reserved resources, but TFDML is most likely used for local training where we shouldn't assume TF deserves all the memory.

A bit more context for change No. 2. Recall that the BFC allocator carves up contiguous allocations and ensures distinct tensors live in physically separate chunks. However, the D3D API exposes GPU memory primarily through resources, not virtual addresses, so we must have resource objects to bind memory when executing operators. Creating resources on the fly is expensive, but caching resources is difficult since chunks are continually merged and split: the offsets and sizes of memory with an allocation/heap will vary significantly from resource to resource. The resource pool solved this by having all resources span the entirety of a heap and rely on offsets during binding. Multiple resources (all of which span the same region of heap memory) allowed callers to get unique instances for the purpose of resource barriers.

NOTE: this change will require some changes to DML debug layer validation (simultaneous bindings of same resource), which isn't enabled in TF by default.