Open jonathan-cohen-nvidia opened 4 years ago
From my reading, these kernels are not "copying data to the GPU embedding tables". These seem to be embedding lookup kernels, which is often expressed as a sparse_matrix * dense_vector computation. Do I misunderstand?
If this is the case, I would perhaps not describe this as a "glorified memcpy", since this is the fundamental operation of the forward pass of the embedding layer. I'm also not sure if accelerator users usually replace the pytorch lookup operation itself "on a per-implementation basis". Apologies if I misunderstand these kernels, and they are not doing embedding lookups.
The reference model uses pytorch's EmbeddingBag. The unit-valued univalent embedding lookup case is a common one for embeddings. Couldn't this optimization be submitted as a fast-path in pytorch (or whatever framework)?
Yes, sorry, you are correct. In the kernels above, the non-zero value is always 1 so it ends up being a copy. But you are correct these are highly specialized sparse matrix-dense vector products.
"Glorified memcpy" is only slightly overstating things, though. These are very simple operations that are effectively copying some data from one format to another. Since it is possible to make all kinds of assumptions about the matrix layout, format, and contents AND these kernels will all run at bandwidth limits without much tuning, you end up wanting a custom-coded routine.
SWG Notes:
The question we consider is if a custom kernel for moving data is allowed under MLPerf rules; in some cases extensive custom kernels seem to not follow the spirit of the rules but when the kernels are simple (such as the one above), it seems like it may be within the spirit of the rules. It seems we need to "draw a line" indicating what is allowed and what isn't.
Kernels like this can have a significant impact on performance; there is desire that if we do work to improve performance it would be great if that work went into a framework or was somehow reusable by others.
We will take some time for people to go over this issue and discuss more next week.
Intel position is that custom kernels are OK as long as they are up streamed and available to other to use. This position applies to inference as well.
From my perspective, a custom kernel written in CUDA would be fine for inference.
SWG:
There is some line here we can work on drawing in the future. We will allow custom kernels for DLRM embeddings for this round. In the future we will figure out how to handle custom kernels.
SWG: This is a one time decision not a rules change.
(moved from https://github.com/mlperf/training/issues/373)
The kernel that copies data to the GPU embedding tables is effectively a glorified memcpy. However, depending on the layout, the hardware, the encoding, etc - it is exactly the sort of thing one would just write in CUDA on a per-implementation basis. There is little value in trying to put "generic" embedding table copy kernels into a library like APEX or a framework itself, since there is a combinatorial explosion of possible variants, and they are (essentially) trivial to write. Many practitioners would therefore write their own.
So we propose that it should be explicitly allowed for a submission to use a custom kernel for embedding table copies (this would of course be provided in source code form along with the submission). Normally we prefer that any optimizations must go into publicly released libraries or frameworks, but we think in this case an exception is warranted.
Below are 2 examples of typical kernels for 2 different cases, illustrating both the simplicity of the kernels and the difficulty in doing this in some "general" way which wouldn't require custom CUDA code.
Kernel for effectively CSR format (non-zeros vary per input)
If the number of non-zeros is fixed, the following kernel is more optimal