microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.5k stars 2.9k forks source link

Debugging capability of onnxruntime in Visual Studio 2019 incapacitated #4812

Open jeyblu opened 4 years ago

jeyblu commented 4 years ago

Describe the bug Variables that used to be expandable and visible in Visual Studio debugger are no longer expandable and visible after ac725b53f6405bc607a61860e983772c33472880 : Convert TensorRT provider into a shared library (#4721)

Urgency Blocking development of dnnl features that are targeted for September 2020

System information

To Reproduce git clone --recursive https://github.com/Microsoft/onnxruntime cd onnxruntime git checkout ac4997665 .\build.bat --config Debug --build_shared_lib --build_wheel --parallel --use_dnnl --skip_tests --cmake_generator "Visual Studio 16 2019" Open build\debug\onnxruntime.sln in Visual Studio Set onnx_test_runner as the startup project in Visual Studio Right click on onnx_test_runner project and select properties In debugging property page, set command arguments to "-e dnnl testdata\squeezenet" and working directory to "$(ProjectDir)$(Configuration)" Put a breakpoint at line 225 of onnxruntime\onnxruntime\core\providers\dnnl\dnnl_execution_provider.cc Start debugging by pressing F5 When it hits the breakpoint, you can expand the node and graph_viewer variables in the debugger to see their values and members.

git checkout ac725b53f Repeat the steps above When it hits the breakpoint, you no longer can access the variables like before.

Expected behavior Visual Studio debugger should be able to expand variables to see their values and members

Screenshots If applicable, add screenshots to help explain your problem. ac4997665 ac4997665

ac725b53f ac725b53f

Additional context Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.

hariharans29 commented 4 years ago

CC: @pranavsharma @RyanUnderhill - They have the best context on the change you are referring to. They might be able to address your concerns.

RyanUnderhill commented 4 years ago

Ah, this is because the shared provider interface hides the definitions of all of the internal objects so they are now opaque handles to the real objects.

A workaround is to reinterpret_cast them in the debugger to the internal types. Like (onnxruntime::GraphViewer)graph_viewer. (since the Provider_GraphViewer is really an onnxruntime::GraphViewer* but the debugger doesn't know that anymore.

This does give me an idea, for debugging we could create a natvis file that tells the debugger this information so it can show up automatically. The drawback is this is relying on implementation details (not all Provider_* interfaces are just a reinterpret_cast).

jeyblu commented 4 years ago

Thanks for the workaround suggestion. It didn't seem to work. Even if it worked, it would be tedious and inefficient to manually cast every variable we wanted to debug. Our request is to restore the visibility of variables so that developers can use the debugger effectively. Thanks! ac725b5

yuslepukhin commented 4 years ago

@RyanUnderhill We may want to provide an addition to autoexp.dat to print our types if possible. See natvis which is a replacement for autoexp.dat

RyanUnderhill commented 4 years ago

@jeyblu Ah, I see what happened. I was doing (onnx::GraphProto*)&graph_proto and that does work. The other one does not, but you can just hop up the callstack and see the original type be passed in.

My natvis idea would likely have the same problem, since the providers do not directly have access to any of the internal types. I'm not sure how to get the debugger to know about these yet.

jeyblu commented 4 years ago

If you go up the callstack, it is GraphViewer, as shown below. However, it didn't work trying to cast it to GraphView as mentioned previously. This was working before changes in ac725b5, which made several objects opaque and impossible to debug. Can you consider reverting some of these changes in ac725b5 to restore debug capability? Thanks.

https://github.com/microsoft/onnxruntime/blob/ac725b53f6405bc607a61860e983772c33472880/onnxruntime/core/providers/dnnl/dnnl_execution_provider.cc#L195 https://github.com/microsoft/onnxruntime/blob/ac725b53f6405bc607a61860e983772c33472880/onnxruntime/core/framework/provider_bridge_ort.cc#L173 https://github.com/microsoft/onnxruntime/blob/ac725b53f6405bc607a61860e983772c33472880/onnxruntime/core/framework/graph_partitioner.cc#L155 https://github.com/microsoft/onnxruntime/blob/ac725b53f6405bc607a61860e983772c33472880/onnxruntime/core/framework/graph_partitioner.cc#L139

RyanUnderhill commented 4 years ago

@jeyblu I checked in a change today that made it so that the casting trick above works (also simplifies a lot of things). I'll try to improve it further but wanted to give an update on the efforts so far.

jeyblu commented 4 years ago

I tried commit b11c10634. It still didn't work.

image

stale[bot] commented 3 years ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

jeyblu commented 3 years ago

Any update?

pranavsharma commented 3 years ago

We've it in our backlog and address it in the upcoming sprints. @RyanUnderhill

jeyblu commented 3 years ago

Is there any update on this issue? Thanks

RyanUnderhill commented 3 years ago

@jeyblu Unfortunately I don't think there is a fix for this currently. The debugger is being true to the code, in that the provider code does not have internal implementation details, so the debugger doesn't either. The method I've used is to just look at the address of the object you want to see internally, then switch to a stack frame inside the onnxruntime core code and cast it to the type. Since that stack frame is in code that knows the full type information, the debugger will show it.

A few more of the types are now shared with the internal code, like the IExecutionProvder and the IAllocator types. So these will now 'just work'. But types like 'GraphViewer' cannot.

jeyblu commented 3 years ago

@RyanUnderhill Thanks for the suggestion. We'll try it. Is it possible to give access to the underlying GraphViewer object in the provider GraphViewer object?