microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.66k stars 2.93k forks source link

CreateSessionFromArray doesn't work #21946

Open woaixiaoxiao opened 2 months ago

woaixiaoxiao commented 2 months ago

Describe the issue

I now want multiple threads to load the same model and perform inference in a data-parallel manner. To reduce memory usage, I want to avoid having each session individually read the ONNX file from disk into memory. The approach I am currently taking is to first read the ONNX file into memory and then use CreateSessionFromArray to create sessions. I referred to this issue: https://github.com/microsoft/onnxruntime/issues/8328 during this process. However, it doesn't seem to be working as expected; CreateSessionFromArray does not save memory usage.

To reproduce

you can use the python script to get the onnx file, and the use c++ code to run。

import torch
import torch.nn as nn

class LSTM(nn.Module):
    def __init__(self, input_size, output_size, out_channels, num_layers, device):
        super(LSTM, self).__init__()
        self.device = device
        self.input_size = input_size
        self.hidden_size = input_size
        self.num_layers = num_layers
        self.output_size = output_size

        self.lstm = nn.LSTM(input_size=self.input_size,
                            hidden_size=self.hidden_size,
                            num_layers=self.num_layers,
                            batch_first=True)

        self.out_channels = out_channels

        self.fc = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, x):
        out, _ = self.lstm(x)

        if self.out_channels == 1:
            out = out[:, -1, :]
            out = self.fc(out)
            return out

        return out

batch_size = 1
input_size = 20
seq_len = 5
output_size = 10
num_layers = 1000
out_channels = 1

model = LSTM(input_size, output_size, out_channels, num_layers, "cpu")
model.eval() 

input_names = ["input"]    
output_names  = ["output"]  

x = torch.randn((batch_size, seq_len, input_size))
y = model(x)

torch.onnx.export(model, x, 'lstm.onnx', verbose=True, input_names=input_names, output_names=output_names,
  dynamic_axes={'input':[0], 'output':[0]} )
#include "onnxruntime_c_api.h"
#include "onnxruntime_session_options_config_keys.h"
#include <chrono>
#include <cstddef>
#include <fstream>
#include <ios>
#include <iostream>
#include <memory>
#include <onnxruntime_cxx_api.h>
#include <unistd.h>
#include <vector>

std::vector<char> loadModel(const char *model_path) {
  std::ifstream model_file(model_path, std::ios::binary | std::ios::ate);
  if (!model_file.is_open()) {
    throw std::runtime_error("无法打开模型文件");
  }

  std::streamsize size = model_file.tellg();
  model_file.seekg(0, std::ios::beg);

  std::vector<char> buffer(size);
  if (!model_file.read(buffer.data(), size)) {
    throw std::runtime_error("无法读取模型文件");
  }

  return buffer;
}

inline size_t getCurrentRSS() {
  std::ifstream stat_stream("/proc/self/stat", std::ios_base::in);
  std::string pid, comm, state, ppid, pgrp, session, tty_nr;
  std::string tpgid, flags, minflt, cminflt, majflt, cmajflt;
  std::string utime, stime, cutime, cstime, priority, nice;
  std::string O, itrealvalue, starttime;
  unsigned long vsize;
  long rss;
  stat_stream >> pid >> comm >> state >> ppid >> pgrp >> session >> tty_nr >>
      tpgid >> flags >> minflt >> cminflt >> majflt >> cmajflt >> utime >>
      stime >> cutime >> cstime >> priority >> nice >> O >> itrealvalue >>
      starttime >> vsize >> rss;
  stat_stream.close();
  return rss * sysconf(_SC_PAGE_SIZE);
}

inline size_t checkMemoryUsage(const std::string &point) {
  size_t memory = getCurrentRSS();
  std::cout << "memory usage at " << point << ": " << memory / (1024.0 * 1024.0)
            << " MB" << std::endl;
  return memory;
}

// ref_func does not use CreateSessionFromArray
std::vector<float> ref_func(int thread_num) {
  Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "Default");
  Ort::SessionOptions session_options;
  session_options.SetIntraOpNumThreads(1);
  session_options.SetGraphOptimizationLevel(
      GraphOptimizationLevel::ORT_ENABLE_ALL);

  const char *model_path = "../5_rnn/lstm.onnx";

  size_t before = checkMemoryUsage("before create session");
  // 创建线程池和多个session
  std::vector<std::unique_ptr<Ort::Session>> sessions;
  for (int i = 0; i < thread_num; ++i) {
    sessions.push_back(
        std::make_unique<Ort::Session>(env, model_path, session_options));
  }
  size_t after = checkMemoryUsage("after create session");
  std::cout << "create session memory usage: "
            << (after - before) / (1024.0 * 1024.0) << " MB" << std::endl;
  return std::vector<float>();
}

std::vector<float> test_func(int thread_num) {
  Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "Default");

  Ort::SessionOptions session_options;
  session_options.SetIntraOpNumThreads(1);
  session_options.SetGraphOptimizationLevel(
      GraphOptimizationLevel::ORT_ENABLE_ALL);
  session_options.AddConfigEntry(
      kOrtSessionOptionsConfigUseORTModelBytesDirectly, "1");

  const char *model_path = "../5_rnn/lstm.onnx";
  std::vector<char> model_data = loadModel(model_path);

  std::cout << "model size: " << model_data.size() / (1024.0 * 1024.0) << " MB"
            << std::endl;

  auto before_create_session = checkMemoryUsage("before create session");

  // 创建线程池和多个session
  std::vector<std::unique_ptr<Ort::Session>> sessions;
  for (int i = 0; i < thread_num; ++i) {

    sessions.push_back(std::make_unique<Ort::Session>(
        env, model_data.data(), model_data.size(), session_options));
  }

  auto after_create_session = checkMemoryUsage("after create session");
  std::cout << "create session memory usage: "
            << (after_create_session - before_create_session) /
                   (1024.0 * 1024.0)
            << " MB" << std::endl;

  return std::vector<float>();
}

int main() {
  //   std::cout << "========= dose not use CreateSessionFromArray ========= "
  //             << std::endl;
  //   ref_func(8);
  std::cout << "========= use CreateSessionFromArray ========= " << std::endl;
  test_func(8);
  return 0;
}

Urgency

No response

Platform

Linux

OS Version

centos8

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-linux-x64-gpu-1.19.0

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

woaixiaoxiao commented 2 months ago

One interesting thing is that only one function can be tested at a time; otherwise, the test results may be inaccurate due to memory used by the previous function not being released in time.

image

image

image

skottmckay commented 2 months ago

InferenceSession::Run is stateless and can be called concurrently. Given that, do you need multiple sessions with the same model?

The settings to use bytes directly require an ORT format model. See https://onnxruntime.ai/docs/performance/model-optimizations/ort-format-models.html#convert-onnx-models-to-ort-format

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.