microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.73k stars 2.94k forks source link

DLRM model failure to execute on GPU #5295

Open CoderHam opened 4 years ago

CoderHam commented 4 years ago

Describe the bug I am running a DLRM model using ONNX Runtime 1.4.0 using the native C API. The same passes using the Python API. But crashes when using the C API.

Urgency Blocks use of DLRM on ONNX Runtime

System information

To Reproduce

The error I received with running on GPU:

2020-09-25 21:36:09.156043014 [E:onnxruntime:test, cuda_call.cc:119 CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=hemantj-X299-A ; expr=cudaMemcpy(dst_data, src_data, bytes, cudaMemcpyHostToDevice); 
2020-09-25 21:36:09.156891585 [E:onnxruntime:, memcpy.cc:19 Compute] CUDA error executing cudaMemcpy(dst_data, src_data, bytes, cudaMemcpyHostToDevice) Copying 52 to 52_CUDAExecutionProvider Input shape:{351} Output shape:{351} X data:0x55f38971fe00 Y data:0x7fbb46500000
2020-09-25 21:36:09.157358507 [E:onnxruntime:, sequential_executor.cc:309 Execute] Non-zero status code returned while running MemcpyFromHost node. Name:'Memcpy' Status Message: CUDA error executing cudaMemcpy(dst_data, src_data, bytes, cudaMemcpyHostToDevice)
terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException'
  what():  /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:123 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:117 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=hemantj-X299-A ; expr=cudaEventDestroy(read_event_); 

Aborted (core dumped)

Expected behavior Model runs successfully on GPU.

tianleiwu commented 4 years ago

@CoderHam, could you share your onnx model so that I could reproduce the issue?

CoderHam commented 4 years ago

@tianleiwu here is the model and the gen script to produce the weights. I used a custom script to produce the failure but onnxruntime_perf_test should confirm the failure as well.

model.zip

jupvfranco commented 3 years ago

I think I'm getting the same or similar problem (please let me know if I should open a different issue).

I'm running:

mpirun -n 2 ~/src/onnxruntime/build/Linux/Debug/onnxruntime_training_mnist --use_cuda --use_nccl --model_name mnist_gemm --train_data_dir mnist_data --log_dir logs/  --num_train_steps 1 --pipeline_parallel_size 2 --cut_group_info T2

and I get the following errors: CUDA failure 700: an illegal memory access was encountered ; GPU=1 ; hostname=jufranc01 ; expr=cudaStreamSynchronize(stream_) CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=jufranc01 ; expr=cudaMemcpy(tensor->MutableDataRaw(), buffer.get() + tensor_offset_in_bytes, tensor->SizeInBytes(), cudaMemcpyDeviceToDevice);

Stack trace: gdb.txt

System information OS Platform and Distribution: Ubuntu 18.04.5 LTS ONNX Runtime installed from: source ONNX Runtime version: onnxruntime-gpu-1.5.0 GCC/Compiler version: 7.5.0 CUDA/cuDNN version: CUDA version 11.0 (and NCCL 2.7.8) GPU model and memory: Tesla V100-PCIE-16GB

mkfilipiuk commented 3 years ago

Is there maybe any update regarding this issue?

CoderHam commented 2 years ago

@jupvfranco @tianleiwu is there any update on the same?

atmadeep commented 1 year ago

Hi, with respect to running DLRM model on onnxruntime, can you please share with me the scripts being used to run, if possible? I want to run the same model (12GB version) on python using onnxruntime backend on CPU.