microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.67k stars 2.93k forks source link

[ onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running FusedMatMul node. Name:'MatMul_With_Transpose_token_14_FusedMatMulAndScale' Status Message: bad allocation unknown file: error: C++ exception with description "Non-zero status code returned while running FusedMatMul node. Name:'MatMul_With_Transpose_token_14_FusedMatMulAndScale' Status Message: bad allocation" thrown in the test body. #16305

Open mlruns opened 1 year ago

mlruns commented 1 year ago

Describe the issue

[ onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running FusedMatMul node. Name:'MatMul_With_Transpose_token_14_FusedMatMulAndScale' Status Message: bad allocation unknown file: error: C++ exception with description "Non-zero status code returned while running FusedMatMul node. Name:'MatMul_With_Transpose_token_14_FusedMatMulAndScale' Status Message: bad allocation" thrown in the test body.

To reproduce

encoder_session->Run(run_options, inputNames, &inputTensor, 1, outputNames, &outputTensorPre, 1);

encoder.session->Run is crashing and giving that error.

Urgency

No response

Platform

Windows

OS Version

10

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.10.0

ONNX Runtime API

Python

Architecture

Other / Unknown

Execution Provider

Default CPU

Execution Provider Library Version

CUDA 11.7

yuslepukhin commented 1 year ago

Please, ensure correct usage of ORT API, make sure that when you create tensors, you pass buffer lengths either in number elements or in bytes, as documented (common mistake).

It does not look it is crashing. My understanding Run() returns an error. You are running out of memory. If you are running on CPU, you may want to disable memory arena, you only need this for GPU.

mlruns commented 1 year ago

Thanks for your response. As I am running on CPU, I tried using disable memory arena in session options . I got this error

Message:

2023-06-11 23:53:14.1898754 [ onnxruntime::SequentialExecutor::Execute] Non-zero status code returned while running DynamicQuantizeMatMul node. Name:'/image_encoder/2/attn/MatMul_quant' Status Message: bad allocation unknown file: error: C++ exception with description "Non-zero status code returned while running DynamicQuantizeMatMul node. Name:'/image_encoder/2/attn/MatMul_quant' Status Message: bad allocation" thrown in the test body.

Stack Trace: , sequential_executor.cc:368 line 368


std::vector<SEGMENT_RESULT> run_SAM_ONNX_model_on_image
 (const SharedClasses::CLynxImage& image, const std::string& encoder_ONNX_filename, const std::string& decoder_ONNX_filename, int model_in_x, int model_in_y, RECT bbox, int cls)
{

    //std::vector<SEGMENT_RESULT> output;

    //cv::setNumThreads(0);

    // We use ORT_API_MANUAL_INIT to allow for delay-loading the OnnxRuntime dll.
    // It's unclear whether its safe to just blindly call InitApi() every time it might be required;
    // for now, test the (private) global api_ pointer to make sure.
    if (!Ort::Global<void>::api_)
    {
        Ort::InitApi();
    }

    SharedClasses::CLynxImage working_copy;
    image.copyTo(working_copy);
    cv::Mat cv_image;
    link_lynx_to_CV_mat(working_copy, cv_image);
    cv::cvtColor(cv_image, cv_image, cv::COLOR_GRAY2RGB);
    int EncoderInputSize = 1024;
    cv::Mat resized_image = ResizeLongestSide_apply_image(cv_image, EncoderInputSize);

    int pad_h = EncoderInputSize - resized_image.rows;
    int pad_w = EncoderInputSize - resized_image.cols;

    cv::Mat padded_image;
    cv::copyMakeBorder(resized_image, padded_image, 0, pad_h, 0, pad_w, cv::BorderTypes::BORDER_CONSTANT, cv::Scalar(0, 0, 0));

    std::vector<SEGMENT_RESULT> output;

    // setting up onnxruntime env
    Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "example-model-explorer");

    //std::vector<int64_t> EncoderOutputShape, EncoderInputShape;
    Ort::AllocatorWithDefaultOptions allocator;
    Ort::MemoryInfo memory_info_handler = Ort::MemoryInfo::CreateCpu(
        OrtArenaAllocator, OrtMemTypeDefault
    );

#ifdef ORTCHAR_T
    std::basic_string<ORTCHAR_T> encoder_model_file = std::basic_string<ORTCHAR_T>(encoder_ONNX_filename.begin(), encoder_ONNX_filename.end());
    std::basic_string<ORTCHAR_T> decoder_model_file = std::basic_string<ORTCHAR_T>(decoder_ONNX_filename.begin(), decoder_ONNX_filename.end());

#else
    auto& model_file = encoder_ONNX_filename;
    auto& model_file = decoder_ONNX_filename;
#endif
    Ort::SessionOptions session_options;
    session_options.SetInterOpNumThreads(1).SetIntraOpNumThreads(1);
    session_options.DisableCpuMemArena();
    //session_options.SetInterOpNumThreads(std::thread::hardware_concurrency());
    //session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
    std::unique_ptr <Ort::Session> encoder_session = std::make_unique <Ort::Session>(env, encoder_model_file.c_str(), session_options);
    if (encoder_session->GetInputCount() != 1 || encoder_session->GetOutputCount() != 1) {
        auto a = 1;
    }
    auto EncoderOutputShape = encoder_session->GetOutputTypeInfo(0).GetTensorTypeAndShapeInfo().GetShape();
    auto EncoderInputShape = encoder_session->GetInputTypeInfo(0).GetTensorTypeAndShapeInfo().GetShape();
    //EncoderInputShape = std::vector<int64_t>{1,3, 1024, 1024 };
    //auto EncoderInputShape = encoder_session.GetInputTypeInfo(0).GetTensorTypeAndShapeInfo().GetShape();
    //resize before blob for python/c++ reproducability

    Ort::Session decoder_session = Ort::Session(env, decoder_model_file.c_str(), session_options);

    //std::vector<uint8_t> inputTensorValues(EncoderInputShape[0] * EncoderInputShape[1] * EncoderInputShape[2] *
    //    EncoderInputShape[3]);

    if (padded_image.size() != cv::Size(EncoderInputShape[3], EncoderInputShape[2])) {
        //std::cerr << "Image size not match" << std::endl;
        //std::cout << "Image width : " << Image.cols << " Image height : " << Image.rows << std::endl;

        //return output;
    }
    if (padded_image.channels() != 3) {
        //std::cerr << "Input image is not a 3-channel image" << std::endl;
        //return output;
    }

    auto blob = cv::dnn::blobFromImage(padded_image, 1.0, cv::Size(EncoderInputShape[3], EncoderInputShape[2]));
    std::vector<uint8_t> inputTensorValues(blob.total());
    inputTensorValues.assign((uint8_t*)blob.data, (uint8_t*)blob.data + blob.total() * blob.channels());

    std::vector<Ort::Value> inputTensor;
    Ort::MemoryInfo memoryInfo = Ort::MemoryInfo::CreateCpu(
        OrtAllocatorType::OrtArenaAllocator, OrtMemType::OrtMemTypeDefault);
    inputTensor.push_back(Ort::Value::CreateTensor<uint8_t>(memoryInfo, inputTensorValues.data(), inputTensorValues.size(), EncoderInputShape.data(), EncoderInputShape.size()));
 std::vector<float>image_embedding = std::vector<float>(EncoderOutputShape[0] * EncoderOutputShape[1] * EncoderOutputShape[2] * EncoderOutputShape[3]);

    auto outputTensorPre = Ort::Value::CreateTensor<float>(
        memory_info_handler, image_embedding.data(), image_embedding.size(),
        EncoderOutputShape.data(), EncoderOutputShape.size());
    assert(outputTensorPre.IsTensor() && outputTensorPre.HasValue());

    //const char* inputNamesPre[] = { ""}, * outputNamesPre[] = {"output"};

    auto* inputName = encoder_session->GetInputName(0, allocator);
    auto* outputName = encoder_session->GetOutputName(0, allocator);

    const char* inputNames[] = { inputName };
    const char* outputNames[] = { outputName };

    Ort::RunOptions run_options;
    run_options.SetRunLogVerbosityLevel(1);
    //encoder_session->Run(run_options, inputNames, &inputTensor, 1, outputNames, &outputTensorPre,
      //  1);
    auto output_tensors = encoder_session->Run(Ort::RunOptions{ nullptr }, inputNames, inputTensor.data(), inputTensor.size(), outputNames, 1);

    //running decoder session

    const char* DecoderInputNames[6]{ "image_embeddings", "point_coords",   "point_labels",
                             "mask_input", "has_mask_input", "orig_im_size" },
        * DecoderOutputNames[3]{ "masks", "iou_predictions", "low_res_masks" };

    float inputPointsValues[] = { bbox.left,bbox.right,bbox.top,bbox.bottom };
    float inputLabelsValues[] = {cls };

    const size_t maskInputSize = 256 * 256;

    float maskInputValues[maskInputSize], hasMaskValues[] = { 0 },
        orig_im_size_values[] = { (float)cv_image.rows, (float)cv_image.cols };

    memset(maskInputValues, 0, sizeof(maskInputValues));

    std::vector<int64_t> inputPointShape = { 1, 3, 2 }, pointLabelsShape = { 1, 3 },
        maskInputShape = { 1, 1, 256, 256 }, hasMaskInputShape = { 1 },
        origImSizeShape = { 2 };

    std::vector<Ort::Value> inputTensorsSam;
    inputTensorsSam.push_back(Ort::Value::CreateTensor<float>(
        memory_info_handler, (float*)image_embedding.data(), image_embedding.size(),
        EncoderOutputShape.data(), EncoderOutputShape.size()));
    inputTensorsSam.push_back(Ort::Value::CreateTensor<float>(
        memory_info_handler, inputPointsValues, 2 * 3, inputPointShape.data(), inputPointShape.size()));
    inputTensorsSam.push_back(Ort::Value::CreateTensor<float>(
        memory_info_handler, inputLabelsValues, 1 * 3, pointLabelsShape.data(), pointLabelsShape.size()));

    inputTensorsSam.push_back(Ort::Value::CreateTensor<float>(
        memory_info_handler, maskInputValues, maskInputSize, maskInputShape.data(), maskInputShape.size()));
    inputTensorsSam.push_back(Ort::Value::CreateTensor<float>(
        memory_info_handler, hasMaskValues, 1, hasMaskInputShape.data(), hasMaskInputShape.size()));
    inputTensorsSam.push_back(Ort::Value::CreateTensor<float>(
        memory_info_handler, orig_im_size_values, 2, origImSizeShape.data(), origImSizeShape.size()));

    Ort::RunOptions runOptionsSam;

    auto DecoderOutputTensors = decoder_session.Run(runOptionsSam, DecoderInputNames, inputTensorsSam.data(),
        inputTensorsSam.size(), DecoderOutputNames, 3);

    auto masks = DecoderOutputTensors[0].GetTensorMutableData<float>();
    auto iou_predictions = DecoderOutputTensors[1].GetTensorMutableData<float>();
    auto low_res_masks = DecoderOutputTensors[2].GetTensorMutableData<float>();

    Ort::Value& masks_ = DecoderOutputTensors[0];
    Ort::Value& iou_predictions_ = DecoderOutputTensors[1];
    Ort::Value& low_res_masks_ = DecoderOutputTensors[2];

    auto mask_dims = masks_.GetTypeInfo().GetTensorTypeAndShapeInfo().GetShape();
    auto iou_pred_dims = iou_predictions_.GetTypeInfo().GetTensorTypeAndShapeInfo().GetShape();
    auto low_res_dims = low_res_masks_.GetTypeInfo().GetTensorTypeAndShapeInfo().GetShape();

    const unsigned int Resizemasks_batch = mask_dims.at(0);
    const unsigned int Resizemasks_nums = mask_dims.at(1);
    const unsigned int Resizemasks_width = mask_dims.at(2);
    const unsigned int Resizemasks_height = mask_dims.at(3);

    //std::vector<SEGMENT_RESULT> output;
    for (unsigned int index = 0; index < Resizemasks_nums; index++)
    {
        //cv::Mat mask(cv_image.rows, cv_image.cols, CV_8UC1);
        std::vector<std::vector<unsigned char>> mask;

        for (unsigned int i = 0; i < cv_image.rows; i++)
        {
            for (unsigned int j = 0; j < cv_image.cols; j++)
            {

                mask[i][j] = masks[i * cv_image.cols + j + index * cv_image.rows * cv_image.cols] > 0 ? 255 : 0;
            }
        }
        SEGMENT_RESULT mat_info;
        mat_info.mask = mask;
        mat_info.iou_pred = *(iou_predictions++);
        output.emplace_back(mat_info);
    }
 return output;
}
yuslepukhin commented 1 year ago

ORT had a size calculation overflow bug in one of the scenarios, you may try to your code with 1.15.1 that was just released.

yuslepukhin commented 1 year ago

From the usage perspective, I am curious to know what makes you allocate Ort::Session on the heap while everything else on the stack?

guschin225 commented 1 year ago

ORT had a size calculation overflow bug in one of the scenarios, you may try to your code with 1.15.1 that was just released.

May I ask which commit (with the fix for the bug you describe) are you referring to here - or at least which version of ONNX the fix was in? Because I also have an issue that seems related (I did not submit it just yet, the gist is that SafeInt overflows when computing the shape of a tensor - very rarely - I am on ONNX 1.13.1). Knowing the commit with the fix would let me know whether the issue I am looking at has already been resolved. Thanks a lot in advance.

guschin225 commented 1 year ago

@yuslepukhin One more bit of data related to the issue I am having - which might be the same issue the original reporter is having - having happened during one run of the model the overflow in SafeInt seems to poison subsequent runs. I am guessing this might be related to the enable_mem_reuse flag, which is on by default.

yuslepukhin commented 1 year ago

@yuslepukhin One more bit of data related to the issue I am having - which might be the same issue the original reporter is having - having happened during one run of the model the overflow in SafeInt seems to poison subsequent runs. I am guessing this might be related to the enable_mem_reuse flag, which is on by default.

SafeInt throws an exception. It is possible, some code is not written to provide strong exception safety guarantees.

Either way, I would need the actual model to look into it, if that is available. Also, generating some random input in the sample to avoid deps would be helpful.

Any fixes would go be made to the next release, so I strongly recommend trying the most recently released version. 1.16 will be out soon.

yuslepukhin commented 1 year ago

I think this might be related: https://github.com/microsoft/onnxruntime/commit/c424e42594d92daba54f264c1c7409e53529d933