Open TedThemistokleous opened 1 year ago
ping @PeixuanZuo @cloudhan @ytaous . Let me know if you know the best person to debug this
Additional debug when enabling verbose logging via --log_verbose for the parity tests.
root@f12f6ee19192:/onnxruntime/onnxruntime/test/python/transformers# python3 test_parity_gelu.py --no_optimize --log_verbose
Testing: device=cuda, float16=False, optimized=False, batch_size=4, sequence_length=2, hidden_size=1, formula=1, fp32_gelu_op=True
/usr/local/lib/python3.8/dist-packages/onnx/mapping.py:27: DeprecationWarning: `np.object` is a deprecated alias for the builtin `object`. To silence this warning, use `object` by itself. Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
int(TensorProto.STRING): np.dtype(np.object)
====== Diagnostic Run torch.onnx.export version 2.1.0.dev20230706+rocm5.5 ======
verbose: False, log level: 40
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
exported: ./temp/gelu_1_fp32.onnx
2023-07-20 16:14:44.305197677 [I:onnxruntime:, inference_session.cc:328 operator()] Flush-to-zero and denormal-as-zero are off
2023-07-20 16:14:44.305212777 [I:onnxruntime:, inference_session.cc:336 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
2023-07-20 16:14:44.305220238 [I:onnxruntime:, inference_session.cc:354 ConstructorCommon] Dynamic block base set to 0
2023-07-20 16:14:44.333905241 [I:onnxruntime:, inference_session.cc:1400 Initialize] Initializing session.
2023-07-20 16:14:44.335605980 [V:onnxruntime:, session_state.cc:1142 VerifyEachNodeIsAssignedToAnEp] Node placements
2023-07-20 16:14:44.335616710 [V:onnxruntime:, session_state.cc:1145 VerifyEachNodeIsAssignedToAnEp] All nodes placed on [MIGraphXExecutionProvider]. Number of nodes: 1
2023-07-20 16:14:44.335625780 [V:onnxruntime:, session_state.cc:126 CreateGraphInfo] SaveMLValueNameIndexMapping
2023-07-20 16:14:44.335635421 [V:onnxruntime:, session_state.cc:172 CreateGraphInfo] Done saving OrtValue mappings.
2023-07-20 16:14:44.336079521 [I:onnxruntime:, allocation_planner.cc:2393 CreateGraphPartitioner] Use DeviceBasedPartition as default
2023-07-20 16:14:44.336111841 [I:onnxruntime:, session_state_utils.cc:201 SaveInitializedTensors] Saving initialized tensors.
2023-07-20 16:14:44.336331146 [I:onnxruntime:, session_state_utils.cc:344 SaveInitializedTensors] Done saving initialized tensors
2023-07-20 16:14:44.336355047 [I:onnxruntime:, inference_session.cc:1767 Initialize] Session successfully initialized.
2023-07-20 16:14:44.343208424 [V:onnxruntime:, sequential_executor.cc:534 ExecuteThePlan] Number of streams: 1
2023-07-20 16:14:44.343232585 [V:onnxruntime:, sequential_executor.cc:184 SessionScope] Begin execution
2023-07-20 16:14:45.282814577 [V:onnxruntime:, sequential_executor.cc:518 ExecuteKernel] stream 0 launch kernel with idx 5
2023-07-20 16:14:45.283567614 [V:onnxruntime:, sequential_executor.cc:534 ExecuteThePlan] Number of streams: 1
2023-07-20 16:14:45.283579404 [V:onnxruntime:, sequential_executor.cc:184 SessionScope] Begin execution
2023-07-20 16:14:45.283630945 [V:onnxruntime:, sequential_executor.cc:518 ExecuteKernel] stream 0 launch kernel with idx 5
Output 0, diff=17.570600510 index=(2, 0, 0) ort=-0.038370539 torch=17.532230377
input tensor([[[-17.0660781860],
[ 5.9965152740]],
[[ 16.1023044586],
[ 4.0838894844]],
[[ 17.5322303772],
[ 10.0256319046]],
[[ -2.4230549335],
[ 3.6469326019]]], device='cuda:0')
torch_outputs (tensor([[[-0.0000000000],
[ 5.9965152740]],
[[16.1023044586],
[ 4.0837988853]],
[[17.5322303772],
[10.0256319046]],
[[-0.0186461769],
[ 3.6464486122]]], device='cuda:0'),)
ort_outputs [array([[[-0.0000000000],
[-0.1111395806]],
[[ 4.6619696617],
[20.7520313263]],
[[-0.0383705385],
[-0.0533541143]],
[[-0.0000000000],
[ 5.0611729622]]], dtype=float32)]
[FAILED] Passed_cases=1/2; Max_diff=17.570600509643555; Diff_count=2
F
======================================================================
FAIL: test_cuda (__main__.TestGeluParity)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_parity_gelu.py", line 236, in test_cuda
self.run_one(self.optimized, gpu, hidden_size=self.hidden_size, formula=i, verbose=self.verbose)
File "test_parity_gelu.py", line 188, in run_one
self.run_test(
File "test_parity_gelu.py", line 184, in run_test
self.assertTrue(num_failure == 0, "Failed: " + test_name)
AssertionError: False is not true : Failed: device=cuda, float16=False, optimized=False, batch_size=4, sequence_length=2, hidden_size=1, formula=1, fp32_gelu_op=True
----------------------------------------------------------------------
Ran 1 test in 1.549s
is there a way for us to add a sync between kernels or waiting on the stream completion. It appears we're not performing the sync before each run.
What is odd is we don't observe this behavior on our MI250 card at all
I am not quite sure how far back have you tried to traceback and failed to find a stable point. If possible, could please try 13495 to see if the stream problem is presented right before the commit?
Sure. Currently tried: f4cd35f9b1301f54d65a3e59c525b92e85bf384e But getting the absel build failures, I may need to patch that before I can get working build.
Rolled back to my original commit MIGraphX stream related things were created (october 2022) but there's no changes for the no_optimize flags in the parity tests.
I have a fix that changes the hipMemCpy() to a proper hipMemcpyAsync() when building Onnxruntime with MIGraphX containing stream sync functionality
Found a fix and opened a PR for this. Please let me know if there are any issues with the changes from the Microsoft side.
@TedThemistokleous Is this still a problem after your PR?
Describe the issue
Currently running through a set of parity tests found in
/onnxruntime/onnxruntime/test/python/transformers/
primarily test_parity_gelu.py and test_parity_layernorm.py
We're experiencing out of order memcopies that seem to occur during kernel execution on our Navi21 card.
Here's an example output when we use ROCm tracing tools to view the sequence of events (captured with our rocprof and then used perfetto/chrome://tracing to view the traces:
I'm able to trigger this case consistently and cut down the GELU test to only perform 2 test runs per kernel which fails always on the second. I found that when we run only 1 test, this out of order error never happens.
I've also noticed that if I increase the hidden layer size in the test_parity_gelu.py test, I can get a point (around 100x hidden layer size) that the tests always pass and we don't get an overlap.
I've cut down the test_parity_gelu.py on a seperate branch here to my ORT fork off mainline: https://github.com/TedThemistokleous/onnxruntime/tree/debug_parity_tests.
The behavior goes away entirely if we add a sync between every single kernel run, thus undoing multi stream execution
The reason I'm bring this up to Onnxruntime is that after a few weeks of debugging this (configuration, previous builds, etc) is that I've been unable to find a working stable point using the Navi21 card (gfx 1030)
From a recent stack trace using GDB with the test I've found the following around said hipMemcpy thats being called via
onnxruntime::GPUDataTransfer::CopyTensor
here's the stack trace I've mentioned.
Urgency
Urgent. Blocking builds of ROCm
Target platform
Navi21
Build script
Error / output
Tests fail due to accuracy errors for test_parity_gelu.py and test_parity_layernorm.py
For layernorm
Visual Studio Version
No response
GCC / Compiler Version
No response