Closed pommedeterresautee closed 2 years ago
That is strange. Can you try reverting the change and trying or building off a commit before my commit ?
"As there is very little commits between 1.11.1 and PR https://github.com/microsoft/onnxruntime/pull/11320...." -- This isn't true. There are quite a few commits that didn't make it into 1.11.1 that could have caused this. Please try with a commit before #11320.
Thank you @hariharans29 for your fast answer. You are totally right and I am sorry for my misleading statement, there have been plenty of commits in between, I didn't realized until recently that in released version, very recent PRs are cherry picked...
So I just rebuilt from fdce4fa6af437b0b822958ab47b3b8f77f9e14ae
which is the last commit in master
before the merge of #11320 :
Command line used to build:
# git clone ...
git checkout -b before_11320 fdce4fa6af437b0b822958ab47b3b8f77f9e14ae
CUDACXX=/usr/local/cuda-11.4/bin/nvcc ./build.sh \
--config Release \
--build_wheel \
--parallel \
--use_cuda \
--cuda_home /usr/local/cuda-11.4 \
--cudnn_home /usr/lib/x86_64-linux-gnu/ \
--skip_test
There is the same error, so it is not related to the PR #11320, again sorry for my misleading statement.
To be sure it's not a compilation issue on my side, I re compiled a second time version 1.11.1 and there is no bug, so basically it's one of the many commits in between. Do you have an idea which PR I may test ?
After a bunch of compilations, it seems that the issue appear with PR #11127 (which is related to external data). Linked issue: https://github.com/microsoft/onnxruntime/issues/10977 Tag to notify: @IkerAriz (PR author), @snnn (reviewer)
To reach this conclusion I have performed the following compilations (master
branch):
Please, let me know if you can reproduce the error message.
Thanks. I will take a look.
@snnn in case you had the time to work on this issue, have you been able to reproduce it? If yes, did you find a way to revert the commit and still have code compiling and working as expected with external data?
I tried a model: tf_inception_v1. It works fine on CPU.
Hi, thank you for your test. It’s the same for me, crash is only with cuda provider.
Looking.
Thank you, I saw it. The buffers of the weights were not filled.
Hi Changming. Which weight buffers were unfilled?
You can get a model from https://github.com/tensorflow/models/tree/master/research/slim. For example, resnet50. Then convert it to ONNX by using TF to onnx converter. Then onnx has an API to split weights to external. Then you run it with CUDA. Then I think the problem is 100% reproducible.
Thank you a lot @snnn for the revert, I can confirm that current master
branch compiled version with the PR revert works on my side too with several >2Gb NLP models too.
Hi @snnn, #11789 is submitted to address this issue and reinstate the mmap copy bypass. Please have a look if you have a chance. Thank you 👍
Describe the bug
A recent PR #11320 has fixed a bug making models with
If
node slower when input is only consumed by subgraphs of theIf
node. However, it seems that it has introduced a bug making Onnx Runtime crash when CUDA provider is used on a model with external data (> 2 Gb models). CPU provider is Ok.It may be related to PR #11320 as code below to reproduce the bug:
As there is very little commits between 1.11.1 and PR #11320 , the PR may have introduced the new behavior.
Error message:
Urgency if bug reproduced by Onnxruntime maintainers, this is probably to be fixed before next release of Onnxruntime
System information
To Reproduce Code below will generate the ONNX file and raise the error message.
Expected behavior not crashing
Screenshots N/A
Additional context tagging @hariharans29 as he seems to know a lot about the topic