Open yuslepukhin opened 2 weeks ago
How about the intermediate memory usage (workspace) for each node? That is usually unknown during partitioning, and even unknown during inference since op has no interface to tell its workspace size right now. For example, MultiHeadAttention op might call different cuda kernels (flash attention, cutlass fmha, tensorrt fmha kernel or unfused kernel), each has different memory consumption.
How about the intermediate memory usage (workspace) for each node? That is usually unknown during partitioning, and even unknown during inference since op has no interface to tell its workspace size right now.
This is true. The function is currently accounts for initializers and inputs. It cannot account for temporary allocations because those are done at inference time, and partitioning takes place well before kernels are instantiated.
The approach of computing memory patterns cannot be taken here since that relies on the presence of a runnable model which we do not have today in a constrained environment.
This PR is still at the experimental stage. I envision that most of the burden would be placed on the individual EPs GetCapability
since EPs are in the best position to know the constraints which can be different from memory only or there might be additional considerations that may be accounted in the Resource Count.
The simplest way is to add an additional if/else to enumerate the kernels and attempt to infer the amount of temporary space. However, that creates an additional maintenance burden since we already have plenty of such places in optimizers and what not where we need to make sure that changes to individual kernels are reflected.
However, it would still work in its current form. One can try one setting and then lower it if the consumption is too much.
Another idea would be to run the model beforehand and record the consumption. Then use that trace to set the limit n the constrained environment.
The function is currently accounts for initializers and inputs. It cannot account for temporary allocations
If so, I think the feature is not very helpful for vision or LLMs models due to the limitations.
Another idea would be to run the model beforehand and record the consumption. Then use that trace to set the limit n the constrained environment.
That's a good idea, and it will be great that we can support the use case.
BTW, a general way to help capacity constraint is that we can have a way to manually configure location of initializers and inputs. This can be extended to support offloading initializers to CPU, and only load them on the GPU when needed.
Description
Allow users to specify per EP specific resource constraints. This PR demonstrates how this can be done with CUDA on memory constraint. In this implementation, we stop assigning nodes to CUDA once we reach the specified memory limit. However, there is a provision for EP to do it automatically.
Motivation and Context
We want to allow models to run in constrained environments.