mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes
https://mlcommons.org/en/groups/training
Apache License 2.0
93 stars 66 forks source link

DLRM - custom embedding kernels should be allowed #291

Open jonathan-cohen-nvidia opened 4 years ago

jonathan-cohen-nvidia commented 4 years ago

(moved from https://github.com/mlperf/training/issues/373)

The kernel that copies data to the GPU embedding tables is effectively a glorified memcpy. However, depending on the layout, the hardware, the encoding, etc - it is exactly the sort of thing one would just write in CUDA on a per-implementation basis. There is little value in trying to put "generic" embedding table copy kernels into a library like APEX or a framework itself, since there is a combinatorial explosion of possible variants, and they are (essentially) trivial to write. Many practitioners would therefore write their own.

So we propose that it should be explicitly allowed for a submission to use a custom kernel for embedding table copies (this would of course be provided in source code form along with the submission). Normally we prefer that any optimizations must go into publicly released libraries or frameworks, but we think in this case an exception is warranted.

Below are 2 examples of typical kernels for 2 different cases, illustrating both the simplicity of the kernels and the difficulty in doing this in some "general" way which wouldn't require custom CUDA code.

Kernel for effectively CSR format (non-zeros vary per input)

// CUDA kernel for sparse embedding where non-zeros are always 1. Thus, no
//   multiplication is needed when reading the rows, and only column and row indices
//   from the CSR format are used (i.e. no values). 
// Inputs:
//   - sparse matrix: input A, CSR format, arbitrarily hot (values are not used).
//   - dense matrix:  input B, dense.
//   - output matrix: output C, dense.
// Parallelization:
//   - 1D grid of 1D threadblocks
//   - threadblock per output row, thread per output element
//   - a thread reads all the inputs it needs to produce its output
template <typename Type>
__global__ void embedKernel(
    const int   num_sparse_rows,   // Arguments for sparse matrix A.
    const int*  __restrict__ g_sparse_col_idxs,
    const int*  __restrict__ g_sparse_row_idxs,
    const int   num_dense_rows,    // Arguments for dense matrix B.
    const int   num_dense_cols,
    const Type* __restrict__ g_dense_values,
    Type*       __restrict__ g_dense_output) {  // Dense matrix C.
  const int output_row_idx = blockIdx.x;
  const int row_start = g_sparse_row_idxs[output_row_idx];
  const int row_end   = g_sparse_row_idxs[output_row_idx + 1];

  Type output_value = Type(0);
  for (int i = row_start; i < row_end; i++) {
    const int  k = g_sparse_col_idxs[i];

    const Type dense_value = g_dense_values[k * num_dense_cols + threadIdx.x];
    output_value += dense_value;
  }
  g_dense_output[output_row_idx * num_dense_cols + threadIdx.x] = output_value;
} 

If the number of non-zeros is fixed, the following kernel is more optimal

// CUDA kernel for sparse embedding with a fixed number of non-zeros per row and
//   nonzeros are always 1. A simplified version of kernel embedKernel, since
//   column indices can be determined from threadblock ID.
template <typename Type>
__global__ void embedKernelFixedNnz(
    const int   num_sparse_rows,   // Arguments for sparse matrix A.
    const int*  __restrict__ g_sparse_col_idxs,
    const int   nnz_per_row,
    const int   num_dense_rows,    // Arguments for dense matrix B.
    const int   num_dense_cols,
    const Type* __restrict__ g_dense_values,
    Type*       __restrict__ g_dense_output) {  // Dense matrix C.
  const int output_row_idx = blockIdx.x;
  const int row_start = output_row_idx * nnz_per_row;

  Type output_value = Type(0);
  for (int i = 0; i < nnz_per_row; i++) {
    const int  k = g_sparse_col_idxs[row_start + i];

    const Type dense_value = g_dense_values[k * num_dense_cols + threadIdx.x];
    output_value += dense_value;
  }
  g_dense_output[output_row_idx * num_dense_cols + threadIdx.x] = output_value;
}
tayo commented 4 years ago

From my reading, these kernels are not "copying data to the GPU embedding tables". These seem to be embedding lookup kernels, which is often expressed as a sparse_matrix * dense_vector computation. Do I misunderstand?

If this is the case, I would perhaps not describe this as a "glorified memcpy", since this is the fundamental operation of the forward pass of the embedding layer. I'm also not sure if accelerator users usually replace the pytorch lookup operation itself "on a per-implementation basis". Apologies if I misunderstand these kernels, and they are not doing embedding lookups.

The reference model uses pytorch's EmbeddingBag. The unit-valued univalent embedding lookup case is a common one for embeddings. Couldn't this optimization be submitted as a fast-path in pytorch (or whatever framework)?

jonathan-cohen-nvidia commented 4 years ago

Yes, sorry, you are correct. In the kernels above, the non-zero value is always 1 so it ends up being a copy. But you are correct these are highly specialized sparse matrix-dense vector products.

"Glorified memcpy" is only slightly overstating things, though. These are very simple operations that are effectively copying some data from one format to another. Since it is possible to make all kinds of assumptions about the matrix layout, format, and contents AND these kernels will all run at bandwidth limits without much tuning, you end up wanting a custom-coded routine.

bitfort commented 4 years ago

SWG Notes:

The question we consider is if a custom kernel for moving data is allowed under MLPerf rules; in some cases extensive custom kernels seem to not follow the spirit of the rules but when the kernels are simple (such as the one above), it seems like it may be within the spirit of the rules. It seems we need to "draw a line" indicating what is allowed and what isn't.

Kernels like this can have a significant impact on performance; there is desire that if we do work to improve performance it would be great if that work went into a framework or was somehow reusable by others.

We will take some time for people to go over this issue and discuss more next week.

rnaidu02 commented 4 years ago

Intel position is that custom kernels are OK as long as they are up streamed and available to other to use. This position applies to inference as well.

tjablin commented 4 years ago

From my perspective, a custom kernel written in CUDA would be fine for inference.

bitfort commented 4 years ago

SWG:

There is some line here we can work on drawing in the future. We will allow custom kernels for DLRM embeddings for this round. In the future we will figure out how to handle custom kernels.

bitfort commented 4 years ago

SWG: This is a one time decision not a rules change.