Shared Session Allocator Causes Crash in Layer

RyanRio commented 9 months ago

Describe the issue

In onnxruntime\onnxruntime\core\mlas\lib\sgemm.cpp throws access violation reading location 0xFFFF... when enabled shared session usage of a custom allocator kOrtSessionOptionsConfigUseEnvAllocators. This happens even when only a single session has been created.

The error itself seems to be happening in MlasGemmFloatKernalFma3, but I don't have the symbols loaded for that (any help there would be appreciated, I've custom built, and supposedly enabled all debug functionality).

I am following https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/test/shared_lib/test_inference.cc, and I believe I'm following it exactly. One thing I may be getting wrong is that a different MockedAllocator instance is being used for initializing the tensors in the example, I'm not sure why this is important. I tried this and same results, though.

I have confirmed that my custom onnx build passes the test.

To reproduce

Here is a minimal example -

Custom allocator:

  struct CustomAllocator: public OrtAllocator {
  private:
    const OrtApi* m_ort;
    OrtMemoryInfo* memory_info;
  public:
    CustomAllocator(const OrtApi* ort) : m_ort{ ort } {
      OrtAllocator::version = ORT_API_VERSION;
      OrtAllocator::Alloc = [](OrtAllocator* this_, size_t size) {
        return static_cast<CustomAllocator*>(this_)->Alloc(size);
      };
      OrtAllocator::Free = [](OrtAllocator* this_, void* p) { static_cast<CustomAllocator*>(this_)->Free(p); };
      OrtAllocator::Info = [](const OrtAllocator* this_) { return static_cast<const CustomAllocator*>(this_)->Info(); };
      ORT_ABORT_ON_ERROR(m_ort->CreateCpuMemoryInfo(OrtDeviceAllocator, OrtMemTypeDefault, &memory_info));
    }
    void Release() {
      m_ort->ReleaseMemoryInfo(memory_info);
    }
    void* Alloc(size_t size) {
      return ptr = ::malloc(size);
    }
    void Free(void* p) {
      ::free(p);
    }
    const OrtMemoryInfo* Info() const {
      return memory_info;
    }
  };

Env creation:

OrtThreadingOptions* tpOptions;
ORT_ABORT_ON_ERROR(m_ort->CreateThreadingOptions(&tpOptions));
ORT_ABORT_ON_ERROR(m_ort->SetGlobalInterOpNumThreads(tpOptions, 1));
ORT_ABORT_ON_ERROR(m_ort->SetGlobalIntraOpNumThreads(tpOptions, 1));
ORT_ABORT_ON_ERROR(m_ort->SetGlobalSpinControl(tpOptions, 0));
ORT_ABORT_ON_ERROR(m_ort->CreateEnvWithGlobalThreadPools(ORT_LOGGING_LEVEL_VERBOSE, "test", tpOptions, &m_env));
m_allocator = new CustomAllocator(m_ort);
ORT_ABORT_ON_ERROR(m_ort->RegisterAllocator(m_env, m_allocator));

Later... session creation and usage

// session creation
ORT_ABORT_ON_ERROR(m_ort->CreateSessionOptions(&m_session_options));
m_ort->AddSessionConfigEntry(m_session_options, kOrtSessionOptionsConfigUseEnvAllocators, "1");
m_ort->DisableCpuMemArena(m_session_options);
m_ort->DisablePerSessionThreads(m_session_options);
ORT_ABORT_ON_ERROR(m_ort->CreateSessionFromArray(m_env data, size, m_session_options, &m_session));

// inference
size_t in_dims[2] = {1, this->m_model.input_len};
size_t out_dims[2] = {1, this->m_model.output_len};
OrtValue* in_tensor;
int* mutable_in_storage;
ORT_ABORT_ON_ERROR(m_ort->CreateTensorAsOrtValue(m_allocator, (const int64_t*)in_dims, 2, TypeToTensorType<int>::type, &in_tensor));
ORT_ABORT_ON_ERROR(m_ort->GetTensorMutableData(in_tensor, reinterpret_cast<void**>(&mutable_in_storage)));
memcpy(mutable_in_storage, in_elements.data(), this->m_model.input_len * sizeof(int));

OrtValue* out_tensor;
float* mutable_out_storage;
ORT_ABORT_ON_ERROR(m_ort->CreateTensorAsOrtValue(m_allocator, (const int64_t*)out_dims, 2, TypeToTensorType<float>::type, &out_tensor));
ORT_ABORT_ON_ERROR(m_ort->GetTensorMutableData(in_tensor, reinterpret_cast<void**>(&mutable_out_storage)));
memcpy(mutable_out_storage, out_elements.data(), this->m_model.output_len * sizeof(float));

ORT_ABORT_ON_ERROR(m_ort->Run(m_session, NULL, &input_name, &in_tensor, 1, &output_name, 1, &out_tensor));

Urgency

No response

Platform

Windows

OS Version

10

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

rel-1.16.3

ONNX Runtime API

C

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

pranavsharma commented 9 months ago

We have a test for this here. I replaced the implementation of MockedOrtAllocator with your allocator and all tests passed. Can you provide a full working example (that I can compile on my machine) along with the model file?

RyanRio commented 9 months ago

Hi Pranav, yeah the test passes on my machine too, fair enough that you need the full example + model. I have to reproduce on a shareable example. I'm going to be away for ~1.5 weeks, sorry for the bad timing, if you want to close and then I'll reopen once I'm back that's fine with me, or not! Thanks 😃

pranavsharma commented 8 months ago

I think I know what the issue is. I debugged a similar issue today with an internal team. The problem is that our math lib assumes a certain alignment the size of which comes from the MlasGetPreferredBufferAlignment() function. The default value is 64. See this. If you simply changed your malloc like this and use alignment = 64, it'll be just fine. Please try this and let me know. This is just a workaround for now. We'll work on a fix. Stay tuned.

RyanRio commented 8 months ago

I think I know what the issue is. I debugged a similar issue today with an internal team. The problem is that our math lib assumes a certain alignment the size of which comes from the `MlasGetPreferredBufferAlignment() function. The default value is 64. See this. If you simply changed your malloc like this and use alignment = 64, it'll be just fine. We'll work on a fix. Stay tuned.

Will do, I've temporarily shifted gears to the linked issue methodology but I'll want to try both code paths in any case for performance analysis, thanks for the temporary workaround.

RyanRio commented 7 months ago

Hi @pranavsharma this does fix it, but quick question - when using a custom allocator like this does m_ort->EnableCpuMemArena(m_session_options) still have any effect? I.e. does the arena just use the custom free and malloc I provide? I still would like to ideally use the ONNX arena, and provide a custom OrtArenaCfg for optimal memory usage, but just have it delegate allocations.

pranavsharma commented 7 months ago

Hi @pranavsharma this does fix it, but quick question - when using a custom allocator like this does m_ort->EnableCpuMemArena(m_session_options) still have any effect? I.e. does the arena just use the custom free and malloc I provide? I still would like to ideally use the ONNX arena, and provide a custom OrtArenaCfg for optimal memory usage, but just have it delegate allocations.

If you supply a custom allocator, the Enable... setting has no effect.

RyanRio commented 7 months ago

When I call EnableCpuMemArena I see this -

2024-04-25 21:53:14.0928601 [I:onnxruntime:test, bfc_arena.cc:29 onnxruntime::BFCArena::BFCArena] Creating BFCArena for Cpu with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 18446744073709551615 arena_extend_strategy: 0
2024-04-25 21:53:14.1031188 [V:onnxruntime:test, bfc_arena.cc:66 onnxruntime::BFCArena::BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2024-04-25 21:53:14.1068891 [I:onnxruntime:, inference_session.cc:1476 onnxruntime::InferenceSession::Initialize] This session will use the allocator registered with the environment.

and when I disable it with DisableCpuMemArena those first 2 lines aren't there. Seems like at the very least it should be disabled so it doesn't wastefully create anything? (And in then in both cases I see later on the same allocations to my custom allocator)

microsoft / onnxruntime