microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.66k stars 2.78k forks source link

Occasional NaN results during inference #19851

Closed goutamyg closed 4 months ago

goutamyg commented 4 months ago

Describe the issue

My C++ inference script for visual object tracking occasionally generates NaN output for all the frames in the video, i.e., nearly 4 out of 10 times, otherwise the outputs are as expected. I have added a condition to verify if the input has any NaNs and the input data seems fine. The Python-based inference script does not have this issue.

To reproduce

Follow the instructions here: https://github.com/goutamyg/MVT.cpp/tree/main/onnxruntime The link has the code and pretrained model

Urgency

Somewhat urgent

Platform

Linux

OS Version

Ubuntu 22.04.3 LTS

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-linux-x64-1.12.1 (also tested with the recent 1.16.1)

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

https://drive.google.com/file/d/15dI9j7UQc35pcWjD0133eRzLh0P_fRvx/view

Is this a quantized model?

No

hariharans29 commented 4 months ago

If the Python based inference script returns right results but the C++ program doesn't yield right results, it is most likely that the pre-processing step in C++ doesn't match the Python pre-processing. The underlying op implementations between the C++ and the Python ORT runtimes are exactly same and unless there is a strange bug in the C++ API layer (we are not aware of any), it is not possible for C++ ORT to produce different results than Python ORT for the same raw input (assuming the ORT versions are the same for Python and C++).

I have added a condition to verify if the input has any NaNs - This is a good check but NaN outputs could come with non-NaN inputs as well depending on the logic in the model, if it working on an assumption of a certain pre-processing logic but the inputs don't conform to that.

A good check would be to see if the "raw" inputs to ORT (after pre-processing) matches between C++ and Python. You can search through the existing (open and closed) issues in the repo - there are quite a few issues along the lines of "Python works but C++ doesn't" (especially for image models) - it almost invariably ends up a quirk of OpenCV that resulted in incorrectly pre-processed inputs. Please look through these issues to see if something in them applies to your case as well.

goutamyg commented 4 months ago

Hi @hariharans29, thank you for your quick response.

  1. I compared the pre-processing results of Python and C++ script, and they match with one another. I have also added the python inference script to my repository.
  2. As I mentioned earlier, the same executable file generates correct results sometimes, while it also outputs NaNs/arbirarily large numbers upon re-execution. The whole codebase is deterministic (i.e., no random initializations, shuffling involved), hence I dont know the reason behind the stochastic nature of the generated output.
  3. I did not find a related issue that matches my description. This one is closely related, where the issue was the input order, which seems fine in my case. If not, I should be getting NaN results consistently upon multiple executions.
  4. Currently I am analyzing the parts of the code that involves float pointers (for example, A and B), in case the pointers are referring to incorrect memory addresses. These code snippets are based on the implementations available online and I have tweaked them to fit my application.
hariharans29 commented 4 months ago

Can you please check if some of the discussion here applies to you : https://github.com/microsoft/onnxruntime/issues/11979 ? The issue is of same theme as yours - Python seems to work and C++ return wrong results. I see that the OP in #11979 is also using OpenCV.

tianleiwu commented 4 months ago

@goutamyg, you can try build from source and enable dumping of node inputs/outputs: https://onnxruntime.ai/docs/build/inferencing.html#debugnodeinputsoutputs That will show you what actual inputs are used, and which node causes the NaN (Usually it is caused by overflow in fp16).

goutamyg commented 4 months ago

@tianleiwu The build is successful, but 1 out of the 4 test cases fail (I think these tests are related to the onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1 flag). I have attached the log file for your reference. output.txt

goutamyg commented 4 months ago

@hariharans29 I checked the C++ code you suggested. It resembles the model architecture I am using (i.e., two inputs and multiple outputs). However, in my case, the output is a set of float values, which are used to compute the bounding-box coordinates as the tracking output. I tried changing my code to handle the inputs and outputs, but the stochastic nature of NaN output persists.

hariharans29 commented 4 months ago

Thanks for the feedback. It is going to be very hard to debug your C++ app as there are some components (like OpenCV) that I am not fully familiar with.

1) Can you freeze and share a raw input (with all necessary pre-processing already done) that best demonstrates the stochastic nature of the outputs (i.e.) more often than not you are able to re-reproduce the NaN output with this said input ?

2) A simple C++ program that consumes the above input, runs ORT with all the settings you use in your original application ?

Also please confirm if the Python ORT version is the same as the C++ ORT version.

hariharans29 commented 4 months ago

@tianleiwu The build is successful, but 1 out of the 4 test cases fail (I think these tests are related to the _onnxruntime_DEBUG_NODE_INPUTSOUTPUTS=1 flag). I have attached the log file for your reference. output.txt

If the build passes, worth a shot to see which layer starts producing NaNs using this approach.

goutamyg commented 4 months ago

@hariharans29 I dont find the libonnxruntime.so file in the build/ folder. The available .so files are libtest_execution_provider.so and libonnxruntime_providers_shared.so, which are throwing errors when I include them in the cmake file.

I did

git clone --recursive --branch v1.12.1 https://github.com/microsoft/onnxruntime.git
cd onnxruntime/
./build.sh --cmake_extra_defines onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1

and the compilation output, before initiating the test cases, says:

Synchronizing submodule url for 'cmake/external/FP16'
Synchronizing submodule url for 'cmake/external/SafeInt/safeint'
Synchronizing submodule url for 'cmake/external/XNNPACK'
Synchronizing submodule url for 'cmake/external/cub'
Synchronizing submodule url for 'cmake/external/cxxopts'
Synchronizing submodule url for 'cmake/external/date'
Synchronizing submodule url for 'cmake/external/dlpack'
Synchronizing submodule url for 'cmake/external/eigen'
Synchronizing submodule url for 'cmake/external/emsdk'
Synchronizing submodule url for 'cmake/external/flatbuffers'
Synchronizing submodule url for 'cmake/external/googlebenchmark'
Synchronizing submodule url for 'cmake/external/googletest'
Synchronizing submodule url for 'cmake/external/json'
Synchronizing submodule url for 'cmake/external/libprotobuf-mutator'
Synchronizing submodule url for 'cmake/external/mimalloc'
Synchronizing submodule url for 'cmake/external/mp11'
Synchronizing submodule url for 'cmake/external/nsync'
Synchronizing submodule url for 'cmake/external/onnx'
Synchronizing submodule url for 'cmake/external/onnx/third_party/benchmark'
Synchronizing submodule url for 'cmake/external/onnx/third_party/pybind11'
Synchronizing submodule url for 'cmake/external/onnx-tensorrt'
Synchronizing submodule url for 'cmake/external/onnx-tensorrt/third_party/onnx'
Synchronizing submodule url for 'cmake/external/onnx-tensorrt/third_party/onnx/third_party/benchmark'
Synchronizing submodule url for 'cmake/external/onnx-tensorrt/third_party/onnx/third_party/pybind11'
Synchronizing submodule url for 'cmake/external/onnx-tensorrt/third_party/onnx/third_party/pybind11/tools/clang'
Synchronizing submodule url for 'cmake/external/onnxruntime-extensions'
Synchronizing submodule url for 'cmake/external/protobuf'
Synchronizing submodule url for 'cmake/external/protobuf/third_party/benchmark'
Synchronizing submodule url for 'cmake/external/protobuf/third_party/googletest'
Synchronizing submodule url for 'cmake/external/pthreadpool'
Synchronizing submodule url for 'cmake/external/pytorch_cpuinfo'
Synchronizing submodule url for 'cmake/external/re2'
Synchronizing submodule url for 'cmake/external/tensorboard'
Synchronizing submodule url for 'cmake/external/wil'
-- 
-- 3.18.1.0
-- Using the single-header code from /home/goutam/third_party_libs/onnxruntime/cmake/external/json/single_include/
-- 
-- ******** Summary ********
--   CMake version             : 3.22.1
--   CMake command             : /usr/bin/cmake
--   System                    : Linux
--   C++ compiler              : /usr/bin/c++
--   C++ compiler version      : 11.4.0
--   CXX flags                 :  -ffunction-sections -fdata-sections -DCPUINFO_SUPPORTED -Wnon-virtual-dtor
--   Build type                : Debug
--   Compile definitions       : EIGEN_MPL2_ONLY;PLATFORM_POSIX;__STDC_FORMAT_MACROS
--   CMAKE_PREFIX_PATH         : 
--   CMAKE_INSTALL_PREFIX      : /usr/local
--   CMAKE_MODULE_PATH         : /home/goutam/third_party_libs/onnxruntime/cmake/external
-- 
--   ONNX version              : 1.12.0
--   ONNX NAMESPACE            : onnx
--   ONNX_USE_LITE_PROTO       : ON
--   USE_PROTOBUF_SHARED_LIBS  : OFF
--   Protobuf_USE_STATIC_LIBS  : ON
--   ONNX_DISABLE_EXCEPTIONS   : OFF
--   ONNX_WERROR               : OFF
--   ONNX_BUILD_TESTS          : OFF
--   ONNX_BUILD_BENCHMARKS     : OFF
--   ONNXIFI_DUMMY_BACKEND     : OFF
--   ONNXIFI_ENABLE_EXT        : OFF
-- 
--   Protobuf compiler         : 
--   Protobuf includes         : 
--   Protobuf libraries        : 
--   BUILD_ONNX_PYTHON         : OFF
-- Configuring done
-- Generating done
-- Build files have been written to: /home/goutam/third_party_libs/onnxruntime/build/Linux/Debug
Consolidate compiler generated dependencies of target flatbuffers
[  0%] Built target flatbuffers
Consolidate compiler generated dependencies of target clog
[  0%] Built target clog
Consolidate compiler generated dependencies of target cpuinfo
[  1%] Built target cpuinfo
Consolidate compiler generated dependencies of target libprotobuf
[  7%] Built target libprotobuf
Consolidate compiler generated dependencies of target libprotoc
[ 13%] Built target libprotoc
Consolidate compiler generated dependencies of target protoc
[ 13%] Built target protoc
[ 13%] Built target gen_onnx_proto
[ 13%] Built target gen_onnx_data_proto
Consolidate compiler generated dependencies of target libprotobuf-lite
[ 16%] Built target libprotobuf-lite
[ 17%] Built target gen_onnx_operators_proto
Consolidate compiler generated dependencies of target onnx_proto
[ 18%] Built target onnx_proto
Consolidate compiler generated dependencies of target onnxruntime_common
[ 20%] Built target onnxruntime_common
Consolidate compiler generated dependencies of target onnxruntime_graph
[ 21%] Built target onnxruntime_graph
Consolidate compiler generated dependencies of target onnxruntime_framework
[ 27%] Built target onnxruntime_framework
Consolidate compiler generated dependencies of target onnxruntime_util
[ 27%] Built target onnxruntime_util
Consolidate compiler generated dependencies of target onnx
[ 31%] Built target onnx
Consolidate compiler generated dependencies of target onnxruntime_providers
[ 44%] Built target onnxruntime_providers
Consolidate compiler generated dependencies of target onnxruntime_providers_shared
[ 44%] Built target onnxruntime_providers_shared
Consolidate compiler generated dependencies of target onnxruntime_optimizer
[ 50%] Built target onnxruntime_optimizer
Consolidate compiler generated dependencies of target onnxruntime_session
[ 51%] Built target onnxruntime_session
Scanning dependencies of target onnxruntime_mlas
Consolidate compiler generated dependencies of target onnxruntime_mlas
[ 56%] Built target onnxruntime_mlas
Consolidate compiler generated dependencies of target flatc
[ 58%] Built target flatc
Consolidate compiler generated dependencies of target onnxruntime_flatbuffers
[ 58%] Built target onnxruntime_flatbuffers
Consolidate compiler generated dependencies of target onnxruntime_test_utils
[ 59%] Built target onnxruntime_test_utils
Consolidate compiler generated dependencies of target onnx_test_data_proto
[ 59%] Built target onnx_test_data_proto
Consolidate compiler generated dependencies of target onnx_test_runner_common
[ 60%] Built target onnx_test_runner_common
Consolidate compiler generated dependencies of target absl_log_severity
[ 61%] Built target absl_log_severity
Consolidate compiler generated dependencies of target absl_raw_logging_internal
[ 61%] Built target absl_raw_logging_internal
Consolidate compiler generated dependencies of target absl_bad_variant_access
[ 61%] Built target absl_bad_variant_access
Consolidate compiler generated dependencies of target gtest
[ 61%] Built target gtest
Consolidate compiler generated dependencies of target gmock
[ 61%] Built target gmock
Consolidate compiler generated dependencies of target nsync_cpp
[ 63%] Built target nsync_cpp
Consolidate compiler generated dependencies of target re2
[ 65%] Built target re2
Consolidate compiler generated dependencies of target absl_spinlock_wait
[ 65%] Built target absl_spinlock_wait
Consolidate compiler generated dependencies of target absl_base
[ 65%] Built target absl_base
Consolidate compiler generated dependencies of target absl_malloc_internal
[ 65%] Built target absl_malloc_internal
Consolidate compiler generated dependencies of target absl_throw_delegate
[ 66%] Built target absl_throw_delegate
Consolidate compiler generated dependencies of target absl_time_zone
[ 67%] Built target absl_time_zone
Consolidate compiler generated dependencies of target absl_debugging_internal
[ 67%] Built target absl_debugging_internal
Consolidate compiler generated dependencies of target absl_stacktrace
[ 67%] Built target absl_stacktrace
Consolidate compiler generated dependencies of target absl_strings_internal
[ 68%] Built target absl_strings_internal
Consolidate compiler generated dependencies of target absl_demangle_internal
[ 68%] Built target absl_demangle_internal
Consolidate compiler generated dependencies of target absl_int128
[ 68%] Built target absl_int128
Consolidate compiler generated dependencies of target absl_strings
[ 69%] Built target absl_strings
Consolidate compiler generated dependencies of target absl_symbolize
[ 69%] Built target absl_symbolize
Consolidate compiler generated dependencies of target absl_exponential_biased
[ 70%] Built target absl_exponential_biased
Consolidate compiler generated dependencies of target absl_graphcycles_internal
[ 71%] Built target absl_graphcycles_internal
Consolidate compiler generated dependencies of target absl_civil_time
[ 71%] Built target absl_civil_time
Consolidate compiler generated dependencies of target absl_time
[ 71%] Built target absl_time
Consolidate compiler generated dependencies of target absl_synchronization
[ 71%] Built target absl_synchronization
Consolidate compiler generated dependencies of target absl_hashtablez_sampler
[ 71%] Built target absl_hashtablez_sampler
Consolidate compiler generated dependencies of target absl_bad_optional_access
[ 71%] Built target absl_bad_optional_access
Consolidate compiler generated dependencies of target absl_raw_hash_set
[ 71%] Built target absl_raw_hash_set
Consolidate compiler generated dependencies of target absl_city
[ 72%] Built target absl_city
Consolidate compiler generated dependencies of target absl_low_level_hash
[ 72%] Built target absl_low_level_hash
Consolidate compiler generated dependencies of target absl_hash
[ 72%] Built target absl_hash
Consolidate compiler generated dependencies of target absl_cord_internal
[ 72%] Built target absl_cord_internal
Consolidate compiler generated dependencies of target absl_cordz_functions
[ 72%] Built target absl_cordz_functions
Consolidate compiler generated dependencies of target absl_cordz_handle
[ 73%] Built target absl_cordz_handle
Consolidate compiler generated dependencies of target absl_cordz_info
[ 73%] Built target absl_cordz_info
Consolidate compiler generated dependencies of target absl_cord
[ 73%] Built target absl_cord
Consolidate compiler generated dependencies of target onnxruntime_test_all
[ 93%] Built target onnxruntime_test_all
Consolidate compiler generated dependencies of target onnx_test_runner
[ 93%] Built target onnx_test_runner
Consolidate compiler generated dependencies of target onnxruntime_perf_test
[ 94%] Built target onnxruntime_perf_test
Consolidate compiler generated dependencies of target onnxruntime_test_debug_node_inputs_outputs
[ 94%] Built target onnxruntime_test_debug_node_inputs_outputs
Consolidate compiler generated dependencies of target onnxruntime_mlas_test
[ 95%] Built target onnxruntime_mlas_test
Consolidate compiler generated dependencies of target custom_op_library
[ 95%] Built target custom_op_library
Consolidate compiler generated dependencies of target test_execution_provider
[ 95%] Built target test_execution_provider
Consolidate compiler generated dependencies of target absl_scoped_set_env
[ 95%] Built target absl_scoped_set_env
Consolidate compiler generated dependencies of target absl_strerror
[ 95%] Built target absl_strerror
Consolidate compiler generated dependencies of target absl_examine_stack
[ 95%] Built target absl_examine_stack
Consolidate compiler generated dependencies of target absl_failure_signal_handler
[ 95%] Built target absl_failure_signal_handler
Consolidate compiler generated dependencies of target absl_leak_check
[ 95%] Built target absl_leak_check
Consolidate compiler generated dependencies of target absl_leak_check_disable
[ 95%] Built target absl_leak_check_disable
Consolidate compiler generated dependencies of target absl_flags_program_name
[ 95%] Built target absl_flags_program_name
Consolidate compiler generated dependencies of target absl_flags_config
[ 95%] Built target absl_flags_config
Consolidate compiler generated dependencies of target absl_str_format_internal
[ 95%] Built target absl_str_format_internal
Consolidate compiler generated dependencies of target absl_flags_marshalling
[ 96%] Built target absl_flags_marshalling
Consolidate compiler generated dependencies of target absl_flags_commandlineflag_internal
[ 96%] Built target absl_flags_commandlineflag_internal
Consolidate compiler generated dependencies of target absl_flags_commandlineflag
[ 96%] Built target absl_flags_commandlineflag
Consolidate compiler generated dependencies of target absl_flags_private_handle_accessor
[ 96%] Built target absl_flags_private_handle_accessor
Consolidate compiler generated dependencies of target absl_flags_reflection
[ 96%] Built target absl_flags_reflection
Consolidate compiler generated dependencies of target absl_flags_internal
[ 96%] Built target absl_flags_internal
Consolidate compiler generated dependencies of target absl_flags
[ 96%] Built target absl_flags
Consolidate compiler generated dependencies of target absl_flags_usage_internal
[ 96%] Built target absl_flags_usage_internal
Consolidate compiler generated dependencies of target absl_flags_usage
[ 96%] Built target absl_flags_usage
Consolidate compiler generated dependencies of target absl_flags_parse
[ 96%] Built target absl_flags_parse
Consolidate compiler generated dependencies of target absl_periodic_sampler
[ 96%] Built target absl_periodic_sampler
Consolidate compiler generated dependencies of target absl_random_distributions
[ 96%] Built target absl_random_distributions
Consolidate compiler generated dependencies of target absl_random_seed_gen_exception
[ 97%] Built target absl_random_seed_gen_exception
Consolidate compiler generated dependencies of target absl_random_internal_seed_material
[ 97%] Built target absl_random_internal_seed_material
Consolidate compiler generated dependencies of target absl_random_internal_platform
[ 98%] Built target absl_random_internal_platform
Consolidate compiler generated dependencies of target absl_random_internal_randen_hwaes_impl
[ 98%] Built target absl_random_internal_randen_hwaes_impl
Consolidate compiler generated dependencies of target absl_random_internal_randen_slow
[ 98%] Built target absl_random_internal_randen_slow
Consolidate compiler generated dependencies of target absl_random_internal_randen_hwaes
[ 98%] Built target absl_random_internal_randen_hwaes
Consolidate compiler generated dependencies of target absl_random_internal_randen
[ 98%] Built target absl_random_internal_randen
Consolidate compiler generated dependencies of target absl_random_internal_pool_urbg
[ 98%] Built target absl_random_internal_pool_urbg
Consolidate compiler generated dependencies of target absl_random_seed_sequences
[ 98%] Built target absl_random_seed_sequences
Consolidate compiler generated dependencies of target absl_random_internal_distribution_test_util
[ 98%] Built target absl_random_internal_distribution_test_util
Consolidate compiler generated dependencies of target absl_status
[100%] Built target absl_status
Consolidate compiler generated dependencies of target absl_statusor
[100%] Built target absl_statusor
Consolidate compiler generated dependencies of target absl_cordz_sample_token
[100%] Built target absl_cordz_sample_token
Consolidate compiler generated dependencies of target absl_bad_any_cast_impl
[100%] Built target absl_bad_any_cast_impl
UpdateCTestConfiguration  from :/home/goutam/third_party_libs/onnxruntime/build/Linux/Debug/DartConfiguration.tcl
Parse Config file:/home/goutam/third_party_libs/onnxruntime/build/Linux/Debug/DartConfiguration.tcl
UpdateCTestConfiguration  from :/home/goutam/third_party_libs/onnxruntime/build/Linux/Debug/DartConfiguration.tcl
Parse Config file:/home/goutam/third_party_libs/onnxruntime/build/Linux/Debug/DartConfiguration.tcl
Test project /home/goutam/third_party_libs/onnxruntime/build/Linux/Debug
Constructing a list of tests
Done constructing a list of tests

Am I missing something?

Also, I will upload a minimal reproducible example in a day or two. I confirm using the same onnxruntime version (v1.12.1) for python and C++ inference.

hariharans29 commented 4 months ago

To get a release flavor libonnxruntime.so, please include --build_shared_lib --config RelWithDebInfo along with the cmake define to build with node input and output debugging capabilities. If you can, please follow instructions in the link Tianlei pasted above to do a minimalistic debugging as to which node is generating the first set of NaNs once you have the minimalistic repro program.

goutamyg commented 4 months ago

Now I am using the libonnxruntime.so built from source with the onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1 flag. I can see the names of the intermediate nodes, and names + shapes of their inputs and outputs. Can you please suggest how to find if any layer is causing the NaN output? Should I add something to the cmake file while building my project?

hariharans29 commented 4 months ago

You can take a look at the env variables available here (one such env variable will dump intermediate results) : https://onnxruntime.ai/docs/build/inferencing.html#debugnodeinputsoutputs. You may have to dump out the intermediate results and go through them to see which ones have the first NaNs.

goutamyg commented 4 months ago

Based on this, I set the following env variables as

    std::string debug_node_io_dump = "ORT_DEBUG_NODE_IO_DUMP_OUTPUT_DATA=1";
    putenv(const_cast<char*>(debug_node_io_dump.c_str()));

    std::string debug_node_io_files = "ORT_DEBUG_NODE_IO_DUMP_DATA_TO_FILES=1";
    putenv(const_cast<char*>(debug_node_io_files.c_str()));

    std::string debug_node_io_destination = "ORT_DEBUG_NODE_IO_DUMP_DATA_DESTINATION=stdout";
    putenv(const_cast<char*>(debug_node_io_destination.c_str()));

    std::string debug_node_io_filter = "ORT_DEBUG_NODE_IO_NAME_FILTER=/backbone/conv_1/block/act_1/Relu_output_0_nchwc";
    putenv(const_cast<char*>(debug_node_io_filter.c_str()));

where backbone/conv_1/block/act_1/Relu_output_0_nchwc is an arbitrary node in the model. However, I don't see the node output being printed in the terminal.

I also tried

std::string debug_node_io_destination = "ORT_DEBUG_NODE_IO_DUMP_DATA_DESTINATION=/path/to/destination";
putenv(const_cast<char*>(debug_node_io_destination.c_str()));

but there is no file containing the intermediate results was saved in the destination folder.

Chatgpt says that ORT_DEBUG_NODE_IO_DUMP_OUTPUT_DATA=1 will enable dumping of output of all the nodes. Can you please confirm it? If yes, how can I save these outputs to a txt file?

hariharans29 commented 4 months ago

Please take a look at some samples in the repo - like this - https://github.com/microsoft/onnxruntime/blob/1fb6cbddee6dc84f3ed720425e42cb789c361696/onnxruntime/python/tools/transformers/models/gpt2/parity_check_helper.py#L39

tianleiwu commented 4 months ago

Try run like the following in Linux to dump to stdout, then redirect to file:

// this will dump input tensors of each node
export ORT_DEBUG_NODE_IO_DUMP_INPUT_DATA=1

// this will dump output tensors of each node
export ORT_DEBUG_NODE_IO_DUMP_OUTPUT_DATA=1

// This will output statistics data like min, max, count of NaN, count of infinity etc
// Could be slow.
// This is optional when you enable full dump using the next flag since it is easy to search 'NaN' in full dump.
export ORT_DEBUG_NODE_IO_DUMP_STATISTICS_DATA=1

// this will enable full tensor dump. Output could be huge if your matrix is large.
// you might not need it if you enable the statistics data using the previous flag.
export ORT_DEBUG_NODE_IO_SNIPPET_THRESHOLD=0

// you can run test application
python test.py > dump.txt

// then use visual studio code to open the dump.txt
goutamyg commented 4 months ago

@tianleiwu Thank you. Now I have the intermediate feature maps dumped to a txt file.

From my analysis, the reason for NaN results (which occurs at some point in the network) is due to some arbitrarily large values present as a part of the input data. The first 16 samples in one of the inputs as per the dumped file are: -1.2811635e+31, 4.5708955e-41, 3.6395099e+17, 3.0736081e-41, 0, 0, 0, 0, 0.37254903, 0.36862746, 0.36470589, 0.36470589, 0.36862746, 0.36862746, 0.37254903, 0.37647063

However, when I print these input values before creating the ort tensor, I get 0.415686, 0.415686, 0.411765, 0.403922, 0.4, 0.392157, 0.384314, 0.380392, 0.372549, 0.368627, 0.364706, 0.364706, 0.368627, 0.368627, 0.372549, 0.376471

The first eight values do not match and are of the order 10^30. This probably is leading to larger values in the feature maps of subsequent layers (see the image below),

Screenshot from 2024-03-14 16-12-31

eventually leading to NaN values. Can you please confirm if this is the appropriate way to create a ort tensor from a std::vector? It is partly based on an existing implementation.

tianleiwu commented 4 months ago

That looks like a bug to me: inputTensorValues_Z is a local variable, so the memory will be released after the function is finished. You will need keep input data alive until inference run is done: for example, bind to blob_z instead and make sure blob_z's life is long enough.

When the input/output tensors is in CPU, you might need call SynchronizeBoundInputs and SynchronizeBoundOutputs. https://onnxruntime.ai/docs/api/c/struct_ort_api.html#aab24784698f8abe8704cb1437583bc05

goutamyg commented 4 months ago

The code is working fine after I kept the std::vectors inputTensorValues_Z and inputTensorValues_X alive during inference run. Thank you very much @tianleiwu @hariharans29