microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.12k stars 2.84k forks source link

[Web] Memory access out of bounds / alignment fault #21355

Open cmario opened 2 months ago

cmario commented 2 months ago

Describe the issue

Hello,

I am exploring the use of ONNX, with a particular focus on the ORT model format for web applications. I developed a basic WASM module to perform inference using a UNET-like semantic segmentation model. However, the inference process throws an exception, which I have detailed below. Please note that the same code runs without issues outside of the WASM module.

I built the ONNX runtime for web with the following command:

./build.sh --config Release --build_wasm_static_lib --minimal_build --skip_tests --disable_wasm_exception_catching --disable_rtti

I built the WASM module with the following command:

emcc -g myModule.cpp -o myModule.js -I<opencv headers> -I<onnxruntime headers> -L<opencv lib> -L<onnxruntime lib> -lopencv_core -lopencv_imgproc -lonnxruntime_webassembly -s INITIAL_MEMORY=256MB -s EXPORTED_FUNCTIONS="['_processImage', '_malloc', '_free']" -s EXPORTED_RUNTIME_METHODS="['ccall', 'cwrap']" -s SAFE_HEAP=0 --bind

When running the inference I get the following error:

2024-07-15 18:48:24.453600 [I:onnxruntime:, inference_session.cc:514 TraceSessionOptions] Session Options {  execution_mode:0 execution_order:DEFAULT enable_profiling:0 optimized_model_filepath: enable_mem_pattern:1 enable_mem_reuse:1 enable_cpu_mem_arena:1 profile_file_prefix:onnxruntime_profile_ session_logid: session_log_severity_level:-1 session_log_verbosity_level:0 max_num_graph_transformation_steps:10 graph_optimization_level:2 intra_op_param:OrtThreadPoolParams { thread_pool_size: 1 auto_set_affinity: 0 allow_spinning: 1 dynamic_block_base_: 0 stack_size: 0 affinity_str:  set_denormal_as_zero: 0 } inter_op_param:OrtThreadPoolParams { thread_pool_size: 0 auto_set_affinity: 0 allow_spinning: 1 dynamic_block_base_: 0 stack_size: 0 affinity_str:  set_denormal_as_zero: 0 } use_per_session_threads:1 thread_pool_allow_spinning:1 use_deterministic_compute:0 config_options: {  } }
2024-07-15 18:48:24.454500 [I:onnxruntime:, inference_session.cc:414 operator()] Flush-to-zero and denormal-as-zero are off
2024-07-15 18:48:24.454600 [I:onnxruntime:, inference_session.cc:422 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
2024-07-15 18:48:24.454800 [I:onnxruntime:, inference_session.cc:440 ConstructorCommon] Dynamic block base set to 0
2024-07-15 18:48:24.462000 [I:onnxruntime:, inference_session.cc:1583 Initialize] Initializing session.
2024-07-15 18:48:24.462100 [I:onnxruntime:, inference_session.cc:1620 Initialize] Adding default CPU execution provider.
2024-07-15 18:48:24.485000 [V:onnxruntime:, session_state.cc:126 CreateGraphInfo] SaveMLValueNameIndexMapping
2024-07-15 18:48:24.485500 [V:onnxruntime:, session_state.cc:172 CreateGraphInfo] Done saving OrtValue mappings.
2024-07-15 18:48:24.488600 [I:onnxruntime:, session_state_utils.cc:201 SaveInitializedTensors] Saving initialized tensors.
2024-07-15 18:48:24.489500 [I:onnxruntime:, session_state_utils.cc:345 SaveInitializedTensors] Done saving initialized tensors
2024-07-15 18:48:24.491300 [I:onnxruntime:, inference_session.cc:1969 Initialize] Session successfully initialized.

With SAFE_HEAP=0:

RuntimeError: memory access out of bounds
    at myModule.wasm.MlasSgemmOperation(CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, unsigned long, unsigned long, unsigned long, float, float const*, unsigned long, float const*, unsigned long, float, float*, unsigned long) (http://localhost:8000/myModule.wasm:wasm-function[9520]:0x24bf4a)
    at myModule.wasm.MlasConvOperation(MLAS_CONV_PARAMETERS const*, float const*, float const*, float const*, float*, float*, unsigned long, unsigned long) (http://localhost:8000/myModule.wasm:wasm-function[10013]:0x2921cc)
    at myModule.wasm.MlasConv(MLAS_CONV_PARAMETERS const*, float const*, float const*, float const*, float*, float*, onnxruntime::concurrency::ThreadPool*) (http://localhost:8000/myModule.wasm:wasm-function[9978]:0x28c58a)
    at myModule.wasm.onnxruntime::Conv<float>::Compute(onnxruntime::OpKernelContext*) const (http://localhost:8000/myModule.wasm:wasm-function[9965]:0x288f4c)
    at myModule.wasm.onnxruntime::LaunchKernelStep::Execute(onnxruntime::StreamExecutionContext&, unsigned long, onnxruntime::SessionScope&, bool const&, bool&) (http://localhost:8000/myModule.wasm:wasm-function[7213]:0x1a3e82)
    at myModule.wasm.onnxruntime::RunSince(unsigned long, onnxruntime::StreamExecutionContext&, onnxruntime::SessionScope&, bool const&, unsigned long) (http://localhost:8000/myModule.wasm:wasm-function[7221]:0x1a6931)
    at myModule.wasm.onnxruntime::ExecuteThePlan(onnxruntime::SessionState const&, gsl::span<int const, 4294967295ul>, gsl::span<OrtValue const, 4294967295ul>, gsl::span<int const, 4294967295ul>, std::__2::vector<OrtValue, std::__2::allocator<OrtValue>>&, std::__2::unordered_map<unsigned long, std::__2::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::__2::hash<unsigned long>, std::__2::equal_to<unsigned long>, std::__2::allocator<std::__2::pair<unsigned long const, std::__2::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>>>> const&, onnxruntime::logging::Logger const&, bool const&, bool, bool) (http://localhost:8000/myModule.wasm:wasm-function[6719]:0x152035)
    at myModule.wasm.onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager const&, gsl::span<OrtValue const, 4294967295ul>, std::__2::vector<OrtValue, std::__2::allocator<OrtValue>>&, std::__2::unordered_map<unsigned long, std::__2::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::__2::hash<unsigned long>, std::__2::equal_to<unsigned long>, std::__2::allocator<std::__2::pair<unsigned long const, std::__2::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>>>> const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool, onnxruntime::Stream*) (http://localhost:8000/myModule.wasm:wasm-function[6718]:0x14f436)
    at myModule.wasm.onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const, 4294967295ul>, gsl::span<OrtValue const, 4294967295ul>, gsl::span<std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const, 4294967295ul>, std::__2::vector<OrtValue, std::__2::allocator<OrtValue>>*, std::__2::vector<OrtDevice, std::__2::allocator<OrtDevice>> const*) (http://localhost:8000/myModule.wasm:wasm-function[17805]:0x6f648b)
    at myModule.wasm.onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<char const* const, 4294967295ul>, gsl::span<OrtValue const* const, 4294967295ul>, gsl::span<char const* const, 4294967295ul>, gsl::span<OrtValue*, 4294967295ul>) (http://localhost:8000/myModule.wasm:wasm-function[5392]:0xf1758)

With SAFE_HEAP=1:

RuntimeError: Aborted(alignment fault)
    at abort (http://localhost:8000/myModule.js:625:41)
    at alignfault (http://localhost:8000/myModule.js:354:3)
    at myModule.wasm (http://localhost:8000/myModule.wasm:wasm-function[17477]:0x862651)
    at myModule.wasm.MlasConvIm2Col(MLAS_CONV_PARAMETERS const*, float const*, float*, unsigned long, unsigned long, unsigned long, unsigned long) (http://localhost:8000/myModule.wasm:wasm-function[9072]:0x2e0492)
    at myModule.wasm.MlasConvOperation(MLAS_CONV_PARAMETERS const*, float const*, float const*, float const*, float*, float*, unsigned long, unsigned long) (http://localhost:8000/myModule.wasm:wasm-function[9074]:0x2e119c)
    at myModule.wasm.MlasConv(MLAS_CONV_PARAMETERS const*, float const*, float const*, float const*, float*, float*, onnxruntime::concurrency::ThreadPool*) (http://localhost:8000/myModule.wasm:wasm-function[9039]:0x2da858)
    at myModule.wasm.onnxruntime::Conv<float>::Compute(onnxruntime::OpKernelContext*) const (http://localhost:8000/myModule.wasm:wasm-function[9026]:0x2d67b0)
    at myModule.wasm.onnxruntime::LaunchKernelStep::Execute(onnxruntime::StreamExecutionContext&, unsigned long, onnxruntime::SessionScope&, bool const&, bool&) (http://localhost:8000/myModule.wasm:wasm-function[6274]:0x1c14f5)
    at myModule.wasm.onnxruntime::RunSince(unsigned long, onnxruntime::StreamExecutionContext&, onnxruntime::SessionScope&, bool const&, unsigned long) (http://localhost:8000/myModule.wasm:wasm-function[6282]:0x1c49bf)
    at myModule.wasm.onnxruntime::ExecuteThePlan(onnxruntime::SessionState const&, gsl::span<int const, 4294967295ul>, gsl::span<OrtValue const, 4294967295ul>, gsl::span<int const, 4294967295ul>, std::__2::vector<OrtValue, std::__2::allocator<OrtValue>>&, std::__2::unordered_map<unsigned long, std::__2::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::__2::hash<unsigned long>, std::__2::equal_to<unsigned long>, std::__2::allocator<std::__2::pair<unsigned long const, std::__2::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>>>> const&, onnxruntime::logging::Logger const&, bool const&, bool, bool) (http://localhost:8000/myModule.wasm:wasm-function[5780]:0x15fa3a)

Best regards, Mario

To reproduce

Here is the code I used to test the ORT model:

extern "C" {
EMSCRIPTEN_KEEPALIVE
void processImage(const uint8_t* inputImageData, size_t inputImageDataSize, uint8_t* outputImageData, int width, int height) {
    cv::Mat image(height, width, CV_8UC4, const_cast<uint8_t*>(inputImageData));
    cv::Mat rgbImage;
    cv::cvtColor(image, rgbImage, cv::COLOR_BGRA2RGB);
    cv::Mat resizedImage;
    cv::resize(rgbImage, resizedImage, cv::Size(256, 256), 0, 0, cv::INTER_AREA);
    cv::Mat f32Image;
    resizedImage.convertTo(f32Image, CV_32F, 1.0 / 255);
    //
    std::vector<float> inputData;
    inputData.assign((float *) f32Image.datastart, (float *) f32Image.dataend);
    //
    Ort::SessionOptions session_options;
    session_options.SetIntraOpNumThreads(1);
    session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);
    // Decode the base64 model
    std::vector<uint8_t> model_data = base64_decode(base64_model);
    // Load the model from memory
    Ort::Env env(ORT_LOGGING_LEVEL_VERBOSE, "test");
    Ort::MemoryInfo memory_info = Ort::MemoryInfo::CreateCpu(OrtAllocatorType::OrtArenaAllocator, OrtMemType::OrtMemTypeDefault);
    Ort::AllocatorWithDefaultOptions allocator;
    Ort::Session session(env, model_data.data(), model_data.size(), session_options);
    // input tensor
    std::vector<int64_t> inputShape = {1, 256, 256, 3};
    Ort::Value inputTensor = Ort::Value::CreateTensor<float>(memory_info, inputData.data(), inputData.size(),
                                                             inputShape.data(), inputShape.size());
    // output tensor
    std::vector<float> outputData(256 * 256 * 4);
    std::vector<int64_t> outputShape = {1, 256, 256, 1};
    Ort::Value outputTensor = Ort::Value::CreateTensor<float>(memory_info,
                                                              outputData.data(), outputData.size(),
                                                              outputShape.data(), outputShape.size());

    auto input_name_alloc = session.GetInputNameAllocated(0, allocator);
    const char *input_name = input_name_alloc.get();
    auto output_name_alloc = session.GetOutputNameAllocated(0, allocator);
    const char *output_name = output_name_alloc.get();

    // Run inference
    session.Run(Ort::RunOptions{nullptr}, &input_name, &inputTensor, 1, &output_name, &outputTensor, 1);

    // Process output tensor
    auto *float_array = outputTensor.GetTensorMutableData<float>();

    // Convert the output tensor to cv::Mat
    cv::Mat outputImg(256, 256, CV_32FC1, float_array);
    outputImg.convertTo(outputImg, CV_8UC1, 255.0);
    cv::resize(outputImg, outputImg, cv::Size(width, height));
    // Copy the output image to the outputImageData buffer
    std::memcpy(outputImageData, outputImg.data, width * height);
}
}

Urgency

No response

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

v1.17.1

Execution Provider

'wasm'/'cpu' (WebAssembly CPU)

YoniGBinahAi commented 1 month ago

I have the same issue with version 1.16.3 and emsdk 3.1.44 my build cmd is : ./build.sh --config Debug --enable_wasm_simd --emsdk_version=3.1.44 --build_wasm_static_lib --enable_wasm_exception_throwing_override --enable_wasm_threads --enable_wasm_api_exception_catching --skip_tests

error : image

YoniGBinahAi commented 1 month ago

@cmario - we have noticed the function MlasSgemmOperation is consuming lots of stack memory. Since wasm by default allocate only 5MB for the stack, it fails there. You can try and add the following flag and see if it solves your issue. it did help us : -s TOTAL_STACK=10MB

cmario commented 1 month ago

@YoniGBinahAi Thank you very much for your feedback, increasing the total stack to 10MB resolves the issue.

cmario commented 1 week ago

Hi @YoniGBinahAi, I noticed that your build command includes thread support. Are you actually enabling ORT's multi-threading mode? In my case, enabling multi-threading always results in the following error:

RuntimeError: table index is out of bounds
at myModule.wasm.onnxruntime::concurrency::ThreadPool::DegreeOfParallelism(onnxruntime::concurrency::ThreadPool const*) (https://localhost:8000/myModule.wasm:wasm-function[7272]:0x2a4f94)
at myModule.wasm.MlasConvPrepare(MLAS_CONV_PARAMETERS*, unsigned long, unsigned long, unsigned long, unsigned long, long long const*, long long const*, long long const*, long long const*, long long const*, long long const*, unsigned long, MLAS_ACTIVATION const*, unsigned long*, float, onnxruntime::concurrency::ThreadPool*) (https://localhost:8000/myModule.wasm:wasm-function[11360]:0x46ba65)
at myModule.wasm.onnxruntime::Conv<float>::Compute(onnxruntime::OpKernelContext*) const (https://localhost:8000/myModule.wasm:wasm-function[11350]:0x4684de)
at myModule.wasm.onnxruntime::LaunchKernelStep::Execute(onnxruntime::StreamExecutionContext&, unsigned long, onnxruntime::SessionScope&, bool const&, bool&) (https://localhost:8000/myModule.wasm:wasm-function[8572]:0x33e190)
at myModule.wasm.onnxruntime::RunSince(unsigned long, onnxruntime::StreamExecutionContext&, onnxruntime::SessionScope&, bool const&, unsigned long) (https://localhost:8000/myModule.wasm:wasm-function[8581]:0x3418af)
at myModule.wasm.onnxruntime::ExecuteThePlan(onnxruntime::SessionState const&, gsl::span<int const, 4294967295ul>, gsl::span<OrtValue const, 4294967295ul>, gsl::span<int const, 4294967295ul>, std::__2::vector<OrtValue, std::__2::allocator<OrtValue>>&, std::__2::unordered_map<unsigned long, std::__2::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::__2::hash<unsigned long>, std::__2::equal_to<unsigned long>, std::__2::allocator<std::__2::pair<unsigned long const, std::__2::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>>>> const&, onnxruntime::logging::Logger const&, bool const&, bool, bool) (https://localhost:8000/myModule.wasm:wasm-function[8141]:0x2d9b2a)
at myModule.wasm.onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager const&, gsl::span<OrtValue const, 4294967295ul>, std::__2::vector<OrtValue, std::__2::allocator<OrtValue>>&, std::__2::unordered_map<unsigned long, std::__2::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::__2::hash<unsigned long>, std::__2::equal_to<unsigned long>, std::__2::allocator<std::__2::pair<unsigned long const, std::__2::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>>>> const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool, onnxruntime::Stream*) (https://localhost:8000/myModule.wasm:wasm-function[8140]:0x2d704f)
at myModule.wasm.onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, gsl::span<OrtValue const, 4294967295ul>, std::__2::vector<OrtValue, std::__2::allocator<OrtValue>>&, ExecutionMode, OrtRunOptions const&, onnxruntime::logging::Logger const&) (https://localhost:8000/myModule.wasm:wasm-function[8145]:0x2db33c)
at myModule.wasm.onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const, 4294967295ul>, gsl::span<OrtValue const, 4294967295ul>, gsl::span<std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const, 4294967295ul>, std::__2::vector<OrtValue, std::__2::allocator<OrtValue>>*, std::__2::vector<OrtDevice, std::__2::allocator<OrtDevice>> const*) (https://localhost:8000/myModule.wasm:wasm-function[19658]:0xa5a58a)
at myModule.wasm.onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<char const* const, 4294967295ul>, gsl::span<OrtValue const* const, 4294967295ul>, gsl::span<char const* const, 4294967295ul>, gsl::span<OrtValue*, 4294967295ul>) (https://localhost:8000/myModule.wasm:wasm-function[7098]:0x288eaa)

Here is how I create the session:

int num_threads = 2;
OrtThreadingOptions *tp_options;
Ort::GetApi().CreateThreadingOptions(&tp_options);
Ort::GetApi().SetGlobalIntraOpNumThreads(tp_options, num_threads);
Ort::GetApi().SetGlobalInterOpNumThreads(tp_options, 1);
Ort::GetApi().SetGlobalSpinControl(tp_options, 0);
OrtEnv *g_env;
Ort::GetApi().CreateEnvWithGlobalThreadPools(ORT_LOGGING_LEVEL_WARNING, "Default", tp_options, &g_env);

Ort::SessionOptions sessionOptions;
sessionOptions.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
sessionOptions.DisableCpuMemArena();
sessionOptions.DisableMemPattern();
sessionOptions.DisableProfiling();
sessionOptions.DisablePerSessionThreads();
sessionOptions.SetExecutionMode(ORT_SEQUENTIAL);
sessionOptions.SetIntraOpNumThreads(num_threads);
sessionOptions.SetInterOpNumThreads(1);
sessionOptions.AddConfigEntry("session.load_model_format", "ORT");
sessionOptions.AddConfigEntry("session.use_ort_model_bytes_directly", "1");
sessionOptions.AddConfigEntry("session.intra_op.allow_spinning", "0");
sessionOptions.AddConfigEntry("session.inter_op.allow_spinning", "0");

session_ = Ort::Session(Ort::Env(g_env), modelData.data(), modelData.size(), sessionOptions);

It works fine when I set num_threads to 1.

I’d appreciate any feedback.

Thank you, Mario