Open RobinGRAPIN opened 10 months ago
related error message: Non-zero status code returned while running Gather node. Name:'/sa1/Gather_136' Status Message: indices element out of data bounds, idx=100 must be within the inclusive range [-100,99]
Yes but the error message happens only during the second Run, as if idx hadn't been reset to 0 at the end of the first inference.
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Still not solved
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
It's not solved, I'm running onnxruntime 1.18.0 dev with TRT backend, and this error is still here, unfortunately. For the first time everything goes OK, but for the second time it throws an error.
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[34], line 7
5 face_image = input_image
6 app.prepare(ctx_id=0, det_size=(640, 640))
----> 7 face_info = app.get(cv2.cvtColor(np.array(face_image), cv2.COLOR_RGB2BGR))
8 face_info = sorted(
9 face_info,
10 key=lambda x:
11 (x['bbox'][2] - x['bbox'][0]) * (x['bbox'][3] - x['bbox'][1]))[
12 -1] # only use the maximum face
13 face_emb = face_info['embedding']
File ~/anaconda3/envs/diffusion/lib/python3.11/site-packages/insightface/app/face_analysis.py:59, in FaceAnalysis.get(self, img, max_num)
58 def get(self, img, max_num=0):
---> 59 bboxes, kpss = self.det_model.detect(img,
60 max_num=max_num,
61 metric='default')
62 if bboxes.shape[0] == 0:
63 return []
File ~/anaconda3/envs/diffusion/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:224, in RetinaFace.detect(self, img, input_size, max_num, metric)
221 det_img = np.zeros( (input_size[1], input_size[0], 3), dtype=np.uint8 )
222 det_img[:new_height, :new_width, :] = resized_img
--> 224 scores_list, bboxes_list, kpss_list = self.forward(det_img, self.det_thresh)
226 scores = np.vstack(scores_list)
227 scores_ravel = scores.ravel()
File ~/anaconda3/envs/diffusion/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py:158, in RetinaFace.forward(self, img, threshold)
156 fmc = self.fmc
157 for idx, stride in enumerate(self._feat_stride_fpn):
--> 158 scores = net_outs[idx]
159 bbox_preds = net_outs[idx+fmc]
160 bbox_preds = bbox_preds * stride
IndexError: list index out of range
I traced down this error to happen exclusively with TensorrtExecutionProvider with trt_cuda_graph_enable on 1.18.0dev version. Everything without cuda graph works fine, including CPUExecutionProvider, CUDAExecutionProvider (with no options though) and TensorrtExecutionProvider with fp16, engine and timing cache. Pretty strange behaviour.
Describe the issue
Running an Ort Session in python two times leads to an error, which is always about an index out of range somewhere in the operations. It makes me think that it is caused by a variable in a "for loop" inside the graph that is not reset between the two runs.
I tried with several networks and I obtain this kind of error during the second run() : InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running Gather node. Name:'/sa1/Gather_136' Status Message: indices element out of data bounds, idx=100 must be within the inclusive range [-100,99]
The input for inference can even be the same than the one used for tracing.
For some networks, an interesting thing that I mentioned is that exporting it using dynamic_axes removes this problem, as if beeing exported this way allow to empty a kind of 'cache' in the inner variables of the model.
To reproduce
Dockerfile
FROM pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel RUN pip install onnx RUN pip install onnxruntime
Minimal code
Architecture
export / import code
"inner variables reset" with dynamic axes
Urgency
No response
Platform
Linux
OS Version
1 SMP Thu Aug 31 10:29:22 EDT 2023
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.16.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response