Closed wdamon-intel closed 9 months ago
can we add to notes that we may clone command list that is currently running ? lgtm
question: does clone is also cloneable - I assume yes, but do we need this clarification? or maybe cloned command list inherits all properties of cloneable.
can we add to notes that we may clone command list that is currently running?
Added to the Notes
question: does clone is also cloneable - I assume yes, but do we need this clarification? or maybe cloned command list inherits all properties of cloneable.
Updated the Notes section to state that specifically the cloned command list inherits all the properties of the source command list.
Will all (or most) of the kernels of a compute graph be in one command list? If the answer is yes, for the case that we only need to change one arg of one kernel, what is the overhead of the clone of command list (containing many kernels)?
@guoyejun - it depends on the size of the command list. Number of kernels in it, types of kernels (number of arguments per kernel, if kernels have special properties), consumed internal heaps and command buffers. And it really depends on the applications how do they want perform computations. Sure, you can use single command list and mutate arguments, but application must synchronize the completion of execution before updating or executing command list again. And if you would like to execute same algorithm with different inputs in parallel, then clone might be viable solution, done once upfront.
let's focus on the clone method due to the force sync in mutate method.
It is expected that there would be many kernels in a command list (hundreds, and thousands in trend).
Could you share which parts will be cloned? For example, if GPU commands is cloned, if internal heaps is newly allocated, etc. I want to get a basic impression if the overhead is small or large.
One possible use case is that only the input buffer (USM) is changed for the first kernel in the command list, assume the extreme case that the pointer is changing at every iteration, will we easily go into device memory OOM due to the clone of command list? I understand it depends on many factors, but is there a very rough estimation?
Clone should have its own heap and command buffer allocations, so cost of cloning should be the proportional to the cost of creating and recording command list. Sure, driver would not be required to encode all commands, indirect data, but rather copy them, so the expectation is to have cloning time proportional to the creation and recording.
It would also depend on how much would be requested to be mutable. For immutable stuff, we can re-use the parent command list read-only structures/heaps. That's why we have this clause that parent command list must be present in order for clones to work.
that's nice that we can re-use the immutable stuff.
just curious how we can re-use it? Per my understanding, the command buffer is one batch buffer, how can we reuse something in the original batch buffer in the cloned batch buffer?
Command buffer is one of resources, there are others like heaps which stores constants. If the kernel is immutable, then we may use the same heap as with parent command list , there is no need to allocate new one.
Summary
Introduce an experimental extension to clone a command list.
Details
Motivation
In order to efficiently support certain use-cases, such as compute graphs, it is desirable to have the ability to clone a command list that has been closed.
Interoperability with Other Extensions
Proposed API
New Flags
ZE_COMMAND_LIST_FLAG_EXP_CLONEABLE
New Functions
zeCommandListCreateCloneExp
Creates a command list as the clone of another command list.
Notes
hCommandList
parameter must beZE_COMMAND_LIST_FLAG_EXP_CLONEABLE
flag, andExample Usage
Creating a Cloneable Command List