Open CoderHam opened 4 years ago
@CoderHam, could you share your onnx model so that I could reproduce the issue?
@tianleiwu here is the model and the gen script to produce the weights. I used a custom script to produce the failure but onnxruntime_perf_test should confirm the failure as well.
I think I'm getting the same or similar problem (please let me know if I should open a different issue).
I'm running:
mpirun -n 2 ~/src/onnxruntime/build/Linux/Debug/onnxruntime_training_mnist --use_cuda --use_nccl --model_name mnist_gemm --train_data_dir mnist_data --log_dir logs/ --num_train_steps 1 --pipeline_parallel_size 2 --cut_group_info T2
and I get the following errors: CUDA failure 700: an illegal memory access was encountered ; GPU=1 ; hostname=jufranc01 ; expr=cudaStreamSynchronize(stream_) CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=jufranc01 ; expr=cudaMemcpy(tensor->MutableDataRaw(), buffer.get() + tensor_offset_in_bytes, tensor->SizeInBytes(), cudaMemcpyDeviceToDevice);
Stack trace: gdb.txt
System information OS Platform and Distribution: Ubuntu 18.04.5 LTS ONNX Runtime installed from: source ONNX Runtime version: onnxruntime-gpu-1.5.0 GCC/Compiler version: 7.5.0 CUDA/cuDNN version: CUDA version 11.0 (and NCCL 2.7.8) GPU model and memory: Tesla V100-PCIE-16GB
Is there maybe any update regarding this issue?
@jupvfranco @tianleiwu is there any update on the same?
Hi, with respect to running DLRM model on onnxruntime, can you please share with me the scripts being used to run, if possible? I want to run the same model (12GB version) on python using onnxruntime backend on CPU.
Describe the bug I am running a DLRM model using ONNX Runtime 1.4.0 using the native C API. The same passes using the Python API. But crashes when using the C API.
Urgency Blocks use of DLRM on ONNX Runtime
System information
To Reproduce
Code used to run the model onnxruntime_c_test.txt
Model build from the Facebook DLRM GitHub repository.
The error I received with running on GPU:
Expected behavior Model runs successfully on GPU.