CUDA Custom Op CUDA failure

HakubunLuo commented 1 year ago

Describe the issue

I implement a custom OP with CUDA, in kernel implement: OrtStatusPtr MyKernel::ComputeV2(OrtKernelContext *context) It call a CUDA kernel class in this:

class MyCudaKernel {
......
public:
    float *weight;
    float *bias;
    float *input_voxel_features;
    unsigned int *input_coords;

    float *output_voxel_features;
}

And the memory was allocated by this:

    cudaMallocManaged(&weight, weight_size);
    cudaMallocManaged(&bias, bias_size);

    cudaMallocManaged(&input_voxel_features, input_voxel_features_size);
    cudaMallocManaged(&input_coords, input_coords_size);
    cudaMallocManaged(&output_voxel_features, output_voxel_features_size);

I get four input of node:

float *features = const_cast<float *>(ctx.GetInput(0).GetTensorData<float>());
unsigned int *indices = const_cast<unsigned int *>(ctx.GetInput(1).GetTensorData<unsigned int>());
float *weights = const_cast<float *>(ctx.GetInput(2).GetTensorData<float>());
float *bias = const_cast<float *>(ctx.GetInput(3).GetTensorData<float>());

std::vector<int64_t> intput_feat_dimensions = ctx.GetInput(0).GetTensorTypeAndShapeInfo().GetShape();
std::vector<int64_t> intput_indices_dimensions = ctx.GetInput(1).GetTensorTypeAndShapeInfo().GetShape();
std::vector<int64_t> weight_dimensions = ctx.GetInput(2).GetTensorTypeAndShapeInfo().GetShape();
std::vector<int64_t> bias_dimensions = ctx.GetInput(3).GetTensorTypeAndShapeInfo().GetShape();

std::vector<int64_t> dimensions = {output_dim, out_channels};
auto output = ctx.GetOutput(0, dimensions);

auto *out = output.GetTensorMutableData<float>();

kernel.input_voxel_features = features;
kernel.input_coords = indices;
kernel.weight = weights;
kernel.bias = bias;

cudaDeviceSynchronize();
cudaStreamDestroy(stream);

cudaMemcpy(out, kernel.output_voxel_features, dimensions[0] * dimensions[1] * sizeof(float ), cudaMemcpyDeviceToDevice);



However, I check the kernel computing process, everything is OK and the final result is as expect. But I meet:
2023-07-18 19:57:23.027897517 [E:onnxruntime:Model, cuda_call.cc:116 CudaCall] CUDA failure 1: invalid argument ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/home/dell/onnxruntime/onnxruntime/core/providers/cuda/gpu_data_transfer.cc ; line=73 ; expr=cudaMemcpyAsync(dst_data, src_data, bytes, cudaMemcpyDeviceToHost, static_cast<cudaStream_t>(stream.GetHandle())); 
terminate called after throwing an instance of 'Ort::Exception'
  what():  CUDA failure 1: invalid argument ; GPU=0 ; hostname=dell-Precision-3650-Tower ; file=/home/dell/onnxruntime/onnxruntime/core/providers/cuda/gpu_data_transfer.cc ; line=73 ; expr=cudaMemcpyAsync(dst_data, src_data, bytes, cudaMemcpyDeviceToHost, static_cast<cudaStream_t>(stream.GetHandle())); 
Why this happens, I check the memory type in my class, all operations are on cuda device.

### To reproduce

You can refer to code in above

### Urgency

_No response_

### Platform

Linux

### OS Version

Ubuntu 18.04

### ONNX Runtime Installation

Built from Source

### ONNX Runtime Version or Commit ID

main

### ONNX Runtime API

C++

### Architecture

X64

### Execution Provider

CUDA

### Execution Provider Library Version

CUDA 11.4

RyanUnderhill commented 1 year ago

Can you provide more info, seeing how you implement MyKernel would help. Specifically you should check what you're returning for GetExecutionProviderType, and GetInputMemoryType.

HakubunLuo commented 1 year ago

After checking input for my custom op, I found something strange, I wrote a function to see data in inputs: `template void printCudaArray(const T d_array, int n) { T h_array = new T[n];

// Transfer the first n elements from device to host
cudaMemcpy(h_array, d_array, n * sizeof(T), cudaMemcpyDeviceToHost);

// Print the first n elements
for (int i = 0; i < n; i++) {
    std::cout << h_array[i] << " ";
}
std::cout << std::endl;

delete[] h_array;

} I get CUDA stream by: Ort::KernelContext ctx(context); cudaStream_t stream = reinterpret_cast(ctx.GetGPUComputeStream());`

And I found that the inputs sometimes may be all zero or the data gets from auto *features = ctx.GetInput(0).GetTensorData<float>(); are wrong.

RyanUnderhill commented 1 year ago

You need to show more of your code to know more about what's going on.

But you can't just 'cudaMemcpy' without having timing issues. You need to be sure the results in that stream are complete first. You could just synchronize the entire device before doing the copy if you're doing this for debugging purposes.

You also did this in your first bit of code:

cudaDeviceSynchronize();
cudaStreamDestroy(stream);

Where are you creating the stream? As you can't delete the onnx runtime stream.

HakubunLuo commented 1 year ago

Here is how I implement data loading in kernel: `OrtStatusPtr MyKernel::ComputeV2(OrtKernelContext *context){

Ort::KernelContext ctx(context);

// Setup inputs
auto intput_feat_dimensions = ctx.GetInput(0).GetTensorTypeAndShapeInfo().GetShape();
auto intput_indices_dimensions = ctx.GetInput(1).GetTensorTypeAndShapeInfo().GetShape();
auto weight_dimensions = ctx.GetInput(2).GetTensorTypeAndShapeInfo().GetShape();

auto *features = ctx.GetInput(0).GetTensorData<float>();
auto *indices = ctx.GetInput(1).GetTensorData<int>();
auto *weights = ctx.GetInput(2).GetTensorData<float>();

// Get kernel parameters from the output
N = (int) intput_feat_dimensions[0];
kernel_size = (int) weight_dimensions[1];
in_channels = (int) weight_dimensions[weight_dimensions.size() - 1];
out_channels = (int) weight_dimensions[0];

if (intput_feat_dimensions[1] != in_channels) {
    throw std::runtime_error("input channels must match kernel size!");
}

...other operations

} I got cuda stream by using: cudaStream_t stream = reinterpret_cast(ctx.GetGPUComputeStream());`

HakubunLuo commented 1 year ago

In addition, when I used my help function to print data in features: Here is one output: 0.185791 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.197402 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Here is another output: 0.185791 -0.83843 -0.629184 -0.34116 -0.449359 -0.181938 0.15788 0.461429 0.053838 0.6696 0.119553 0 0 0 0 0 -0.197402 -0.0100531 -0.100781 0.0967386 0.0356794 -0.19282 0.229912 -0.0361991 0.059382 -0.138059 -0.200917 0 0 0 0 0 It looks like not all values are correctly in the intput. Sometimes it can be all zero so that it influence my further operations, I also try to create a copy of input data: cudaMemcpyAsync(features, features, intput_feat_dimensions[0] * intput_feat_dimensions[1] * sizeof(float), cudaMemcpyDeviceToDevice, stream); cudaMemcpyAsync(coords, indices, intput_indices_dimensions[0] * intput_indices_dimensions[1] * sizeof(int), cudaMemcpyDeviceToDevice, stream); But it also can not get integrated data

HakubunLuo commented 1 year ago

You need to show more of your code to know more about what's going on.

But you can't just 'cudaMemcpy' without having timing issues. You need to be sure the results in that stream are complete first. You could just synchronize the entire device before doing the copy if you're doing this for debugging purposes.

You also did this in your first bit of code:
cudaDeviceSynchronize();
cudaStreamDestroy(stream);
Where are you creating the stream? As you can't delete the onnx runtime stream.

Could you hint me that how can I synchronize the device when I get input? I didn't find documents related to this in ONNX runtime documents. I used cudaDeviceSynchronize(); after calling GetTensorData but it not works. Thanks

RyanUnderhill commented 1 year ago

}I got cuda stream by using: cudaStream_t stream = reinterpret_cast(ctx.GetGPUComputeStream());`

Yes, this stream is owned by onnxruntime, do not try to destroy it. I'm surprised that didn't result in other failures in the code by making the stream invalid.

Could you hint me that how can I synchronize the device when I get input? I didn't find documents related to this in ONNX runtime documents. I used cudaDeviceSynchronize(); after calling GetTensorData but it not works. Thanks

Hmm, everything done using the same stream should not need any extra synchronization. The stream is the synchronization.

Are you accessing data from the CPU and not seeing it reliably? Like doing: cudaMemcpyAsync(... cudaMemcpyDeviceToHost, stream). For this case you need to do a 'cudaStreamSynchronize(stream)' to ensure the previous operations on the stream are complete before the CPU accesses it.

Are you using multiple streams?

HakubunLuo commented 1 year ago

Yes I only use one (default) stream.

HakubunLuo commented 1 year ago

After testing with different data, I found the problem may happen when the data size of input is large (more than 100,000 * 16), is there any limitations for ONNX about context input and output?

RyanUnderhill commented 1 year ago

The only limitation is available memory in the device you're running the model on. 100,000*16 is pretty small.

microsoft / onnxruntime

CUDA Custom Op CUDA failure #16748

Describe the issue