Sorting in embedding_dense_backward_cuda takes very long time

rabbitwayne commented 4 years ago

Sorting the indices in embedding_dense_backward_cuda takes very long time. In BERT training, the sorting part in embedding_dense_backward_cuda takes almost 99% of the embedding backward time, which slows down backward pass a lot. Can anyone explain why are the indices sorted here? Is there anyway to improve this? Thanks a lot!

cc @ngimel @VitalyFedyunin @mruberry

rabbitwayne commented 4 years ago

Please refer the code here: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/Embedding.cu

zou3519 commented 4 years ago

Do you have a self-contained script that demonstrates the slowdown?

ngimel commented 4 years ago

Also, please make sure that it is not a profiling artefact, similiar to #24912, where it turned out that embedding backward actually took reasonable time. Sorting in embedding backward is used to make sure that results are deterministic. Smarter algorithms can be used to improve performance, but so far we have not seen a case where sorting would be more than say 2% of overall time, usually much less, which makes it a not very interesting optimization target.

rabbitwayne commented 4 years ago

@zou3519 @ngimel Thank you for your reply! I don't see there is a conclusion in #24912. What causes the backward taking 1300x longer than the forward? It is unclear after reading the entire post that the cause of the problem is pinpointed. I am seeing exactly the same problem as in #24912 where the embedding backward takes way longer than the forward, which makes me do the profiling.

To reproduce this problem, please follow the following steps.

Get BERT pertaining running. I am using the code here: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT. Please use BERT large and batch size of 64. I am training on 1 V100. If you do profiling, you will see the backward pass takes about 12x longer than the forward pass for one batch. Further profiling will show that embedding backward dominates the backward pass.
Apply my profiling code in pytorch/aten/src/ATen/native/cuda/Embedding.cu. I have attached my Embedding.cu code with profiling. Re-compile and install pytorch. Then rerun BERT pertaining. You will see the profiling results.

In my profiling, embedding_dense_backward_cuda takes 141751us, among which sorting takes 141563us.

Can you please try reproducing this problem? Or at least take a look at my profiling code and see if I made any stupid mistakes?

The number of indices in embedding lookup is 8192. I am not sure if this size is too large for sorting.

Thank you very much for your help!

Below is my profiling code:

Tensor embedding_dense_backward_cuda(const Tensor & grad_, const Tensor & indices,
                               int64_t num_weights, int64_t padding_idx,
                               bool scale_grad_by_freq) {
  std::cout << "Entered embedding_dense_backward_cuda" << std::endl;
  auto stamp0 = std::chrono::high_resolution_clock::now();
  auto grad_arg = TensorArg(grad_, "grad", 1);
  auto indices_arg = TensorArg(indices, "indices", 1);
  checkScalarType("embedding_backward", indices_arg, kLong);
  checkSameGPU("embedding_backward", grad_arg, indices_arg);

  auto num_indices = indices.numel();
  auto grad = grad_.contiguous().view({num_indices, grad_.size(-1)});
  cudaStream_t stream = at::cuda::getCurrentCUDAStream();

  std::cout << "num_indices: " << num_indices << std::endl;
  if (num_indices <= 768 && !scale_grad_by_freq) {
    auto indices_contig = indices.contiguous();
    auto grad_weight = at::zeros({num_weights, grad_.size(-1)}, grad_.options());
    int64_t stride = grad_weight.stride(0);
    dim3 grid(THCCeilDiv(stride, (int64_t)C10_WARP_SIZE));
    dim3 block(C10_WARP_SIZE, BLOCKDIMY);

    AT_DISPATCH_FLOATING_TYPES_AND_HALF
      (grad.scalar_type(),
       "embedding_backward",
       [&]
       {
         using accscalar_t = acc_type<scalar_t, true>;
         embedding_backward_feature_kernel<scalar_t, accscalar_t>
           <<<grid,
              block,
              sizeof(accscalar_t)*C10_WARP_SIZE*BLOCKDIMY + sizeof(int)*C10_WARP_SIZE*BLOCKDIMY,
              stream>>>
           (indices_contig.data_ptr<int64_t>(),
            grad.data_ptr<scalar_t>(),
            grad_weight.data_ptr<scalar_t>(),
            static_cast<int>(num_indices),
            static_cast<int64_t>(stride),
            static_cast<int>(padding_idx));
       });

    THCudaCheck(cudaGetLastError());
    return grad_weight;
  }

  auto stamp0_1 = std::chrono::high_resolution_clock::now();
  auto sorted_indices = at::empty_like(indices);
  auto orig_indices = at::empty_like(indices);
  using device_ptr = thrust::device_ptr<int64_t>;

  // Sort the inputs into sorted with the corresponding indices; we
  // don't need a stable or multidimensional sort, so just use Thrust
  // directly
  {
    sorted_indices.copy_(indices);

    auto allocator = THCThrustAllocator(globalContext().lazyInitCUDA());
    auto policy = thrust::cuda::par(allocator).on(stream);

    // Fill sortedOrigIndices with sequential indices
    auto count_iter = thrust::counting_iterator<int64_t>(0);
    auto orig_data = device_ptr(orig_indices.data_ptr<int64_t>());
    thrust::copy(policy, count_iter, count_iter + num_indices, orig_data);

    // Sort; a stable sort is not required
    auto sorted_data = device_ptr(sorted_indices.data_ptr<int64_t>());
    thrust::sort_by_key(policy, sorted_data, sorted_data + num_indices, orig_data,
                        ThrustLTOp<int64_t>());
  }

  auto stamp0_2 = std::chrono::high_resolution_clock::now();
  Tensor count;
  if (scale_grad_by_freq) {
    count = at::empty_like(indices);

    auto allocator = THCThrustAllocator(globalContext().lazyInitCUDA());
    auto policy = thrust::cuda::par(allocator).on(stream);

    // Compute an increasing sequence per unique item in sortedIndices:
    // sorted: 2 5 5 5 7 7 8 9 9
    //  count: 1 1 2 3 1 2 1 1 2
    auto sorted_data = device_ptr(sorted_indices.data_ptr<int64_t>());
    auto count_data = device_ptr(count.data_ptr<int64_t>());
    thrust::inclusive_scan_by_key(
      policy,
      sorted_data,
      sorted_data + num_indices,
      thrust::make_constant_iterator(1),
      count_data
    );

    // Take the maximum of each count per unique key in reverse:
    // sorted: 2 5 5 5 7 7 8 9 9
    //  count: 1 3 3 3 2 2 1 2 2
    thrust::inclusive_scan_by_key(
      policy,
      thrust::make_reverse_iterator(sorted_data + num_indices),
      thrust::make_reverse_iterator(sorted_data),
      thrust::make_reverse_iterator(count_data + num_indices),
      thrust::make_reverse_iterator(count_data + num_indices),
      thrust::equal_to<int64_t>(),
      thrust::maximum<int64_t>()
    );
  }

  auto stamp1 = std::chrono::high_resolution_clock::now();

  Tensor results = embedding_backward_cuda_kernel(grad, orig_indices,
      sorted_indices, count, num_weights, padding_idx);
  auto stamp2 = std::chrono::high_resolution_clock::now();
  std::cout << "1st backward cuda takes " << std::chrono::duration_cast<std::chrono::microseconds>(stamp1 - stamp0).count() << std::endl;
  std::cout << "2nd backward kernel takes " << std::chrono::duration_cast<std::chrono::microseconds>(stamp2 - stamp1).count() << std::endl;
  std::cout << "first kernel takes " << std::chrono::duration_cast<std::chrono::microseconds>(stamp0_1 - stamp0).count() << std::endl;
  std::cout << "sorting takes " << std::chrono::duration_cast<std::chrono::microseconds>(stamp0_2 - stamp0_1).count() << std::endl;
  std::cout << "scale takes " << std::chrono::duration_cast<std::chrono::microseconds>(stamp1 - stamp0_2).count() << std::endl;
  return results;
}

ngimel commented 4 years ago

The conclusion in #24912 was that embedding does not take 1300x longer than forward, profiling data posted by @colesbury showed that embedding backward was in the single digits percentage-wise for end-to-end time, about 4 times more expensive than embedding forward, and sorting itself was ~2% https://gist.github.com/colesbury/904b68a2a47b820d9d91db3e04d9800e 1300x figure was an artifact of asynchronous cuda behavior and thus misleading timings.

rabbitwayne commented 4 years ago

@ngimel The profiling results in https://gist.github.com/colesbury/904b68a2a47b820d9d91db3e04d9800e have only the GPU timing but no CPU timing. I understand that embedding backward GPU kernel takes very little time and this is in accordance with my profiling results as well. However, in my profiling at::embedding_dense_backward indeed takes much much longer than the GPU kernel within it. at::embedding_dense_backward takes 141751us and the GPU kernel within it only takes about 187us. My profiling results show that more than 99% of the time is spent on sorting which takes 141563us. Can you take a look at the code I pasted above and see if my profiling makes sense? I made time stamps in the code using "auto stamp0 = std::chrono::high_resolution_clock::now();" and got the timing for sorting by subtracting two time stamps.

ngimel commented 4 years ago

You should be calling cudaDeviceSynchronize before your timing calls otherwise you are measuring idle cpu time due to asynchronous cuda execution. The only time you should care about is the time of the GPU kernels (that you can obtain from nvprof), sort happens on the GPU, not CPU.

rabbitwayne commented 4 years ago

Screen Shot 2019-12-04 at 3 00 38 PM Screen Shot 2019-12-04 at 3 08 18 PM

Here are my two profiling results, one with cuda turned off and the other one with cuda turned on. When cuda is turned off, embedding_dense_backward takes about 141ms, while when cuda is turned on, embedding_dense_backward takes about 18ms, cpu time. Why is there such a big difference when cuda is turned on and off? Also the cuda time for embedding_dense_backward is about 1ms, while its cpu time is about 18ms. Why cpu time is so much longer than cuda time? Can I still trust pytorch profiler?

rabbitwayne commented 4 years ago

I added cudaDeviceSynchronize and now the profiled time corresponds with the GPU time. Embedding backward takes about 300~400us and sorting takes about 100us. All the embedding backward takes a little more than 1ms and makes sense.

So the cpu time obtained by pytorch profiler may include idle time due to the asynchronous CUDA Kernel launch and cannot be trusted. If an API has both significant cpu and gpu time, then how to accurately measure both times? It looks like the pytorch profiler cannot be trusted and the cpu time is very unreliable.

geekyGoku commented 1 year ago

Hello,

Is there any update on this, I see similar issues, where embedding backward takes a huge amount of time.

pytorch / pytorch

Sorting in embedding_dense_backward_cuda takes very long time #30711