triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.13k stars 1.46k forks source link

Why get huge matrix, cannot get detection results, when using C API to implement " densenet_onnx" model inference like simple.cc, #5859

Closed wangweilovelife closed 1 year ago

wangweilovelife commented 1 year ago

here is my code: // inference and the input tensors. TRITONSERVER_InferenceRequest irequest = nullptr; FAIL_IF_ERR( TRITONSERVER_InferenceRequestNew( &irequest, server.get(), model_name.c_str(), 1 / model_version */), "creating inference request");

FAIL_IF_ERR( TRITONSERVER_InferenceRequestSetId(irequest, "my_request_id"), "setting ID for the request");

FAIL_IF_ERR( TRITONSERVER_InferenceRequestSetReleaseCallback( irequest, InferRequestComplete, nullptr / request_release_userp /), "setting request release callback");

// Load image and preprocess cv::Mat image = cv::imread("/home/jingyao.ww/code/data/smoke_a388.jpg"); cv::resize(image, image, cv::Size(224, 224), cv::INTER_LINEAR); cv::cvtColor(image, image, cv::COLOR_BGR2RGB); image.convertTo(image, CV_32F, 1.0f / 255.0f); std::vector input_data; if (image.isContinuous()) { input_data.assign(image.data, image.data + image.total()image.channels()); } else { for (int i = 0; i < image.rows; ++i) { input_data.insert(input_data.end(), image.ptr(i), image.ptr(i) + image.cols image.channels()); } }

auto input = "input";

size_t input_size = input_data.size();
const TRITONSERVER_DataType input_datatype = TRITONSERVER_TYPE_FP32;
std::vector input_shape({image.channels(), image.rows, image.cols});

FAIL_IF_ERR( TRITONSERVER_InferenceRequestAddInput( irequest, input, input_datatype, &input_shape[0], input_size), "setting input 0 meta-data for the request");

wangweilovelife commented 1 year ago

and the results get: image

dyastremsky commented 1 year ago

Please follow the bug report template. We need this to be fully reproducible on our side to be able to investigate the issue. Please provide clear steps. Also, let us know what the issue actually is, because it looks like you are getting an output.

--

Description A clear and concise description of what the bug is.

Triton Information What version of Triton are you using?

Are you using the Triton container or did you build it yourself?

To Reproduce Steps to reproduce the behavior.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Expected behavior A clear and concise description of what you expected to happen.

wangweilovelife commented 1 year ago

Thank you for your reply.

Description: Unable to obtain correct results when deploying detection model using C API.

Triton Information: I followed the guide: Build Triton Inference Server from source and the path: Building With Docker with ./build.py -v --enable-all. Then I used docker images tritonserver_buildbase to build my own container.

To Reproduce: I modify the code based on simple.cc to process the densenet_onnx model and this model from the official model library in the server/docs/examples/model_repository/densenet_onnx. The code: “”

include <rapidjson/document.h>

include <rapidjson/error/en.h>

include

include

include

include

include

include

include

include

include

include "common.h"

include "triton/core/tritonserver.h"

include

include

ifdef TRITON_ENABLE_GPU

include

endif // TRITON_ENABLE_GPU

include <opencv2/core/version.hpp>

if CV_MAJOR_VERSION == 2

include <opencv2/core/core.hpp>

include <opencv2/highgui/highgui.hpp>

include <opencv2/imgproc/imgproc.hpp>

elif CV_MAJOR_VERSION >= 3

include <opencv2/core.hpp>

include <opencv2/highgui.hpp>

include <opencv2/imgproc.hpp>

endif

if CV_MAJOR_VERSION == 4

define GET_TRANSFORMATIONCODE(x) cv::COLOR##x

else

define GET_TRANSFORMATIONCODE(x) CV##x

endif

namespace ni = triton::server;

namespace {

bool enforce_memory_type = false; TRITONSERVER_MemoryType requested_memory_type;

ifdef TRITON_ENABLE_GPU

static auto cuda_data_deleter = [](void* data) { if (data != nullptr) { cudaPointerAttributes attr; auto cuerr = cudaPointerGetAttributes(&attr, data); if (cuerr != cudaSuccess) { std::cerr << "error: failed to get CUDA pointer attribute of " << data << ": " << cudaGetErrorString(cuerr) << std::endl; } if (attr.type == cudaMemoryTypeDevice) { cuerr = cudaFree(data); } else if (attr.type == cudaMemoryTypeHost) { cuerr = cudaFreeHost(data); } if (cuerr != cudaSuccess) { std::cerr << "error: failed to release CUDA pointer " << data << ": " << cudaGetErrorString(cuerr) << std::endl; } } };

endif // TRITON_ENABLE_GPU

void Usage(char** argv, const std::string& msg = std::string()) { if (!msg.empty()) { std::cerr << msg << std::endl; }

std::cerr << "Usage: " << argv[0] << " [options]" << std::endl; std::cerr << "\t-m <\"system\"|\"pinned\"|gpu>" << " Enforce the memory type for input and output tensors." << " If not specified, inputs will be in system memory and outputs" << " will be based on the model's preferred type." << std::endl; std::cerr << "\t-v Enable verbose logging" << std::endl; std::cerr << "\t-r [model repository absolute path]" << std::endl;

exit(1); }

TRITONSERVER_Error ResponseAlloc( TRITONSERVER_ResponseAllocator allocator, const char tensor_name, size_t byte_size, TRITONSERVER_MemoryType preferred_memory_type, int64_t preferred_memory_type_id, void userp, void buffer, void buffer_userp, TRITONSERVER_MemoryType actual_memory_type, int64_t actual_memory_type_id) { // Initially attempt to make the actual memory type and id that we // allocate be the same as preferred memory type actual_memory_type = preferred_memory_type; actual_memory_type_id = preferred_memory_type_id;

// If 'byte_size' is zero just return 'buffer' == nullptr, we don't // need to do any other book-keeping. if (byte_size == 0) { buffer = nullptr; buffer_userp = nullptr; std::cout << "allocated " << byte_size << " bytes for result tensor " << tensor_name << std::endl; } else { void allocated_ptr = nullptr; if (enforce_memory_type) { actual_memory_type = requested_memory_type; }

switch (*actual_memory_type) {

ifdef TRITON_ENABLE_GPU

  case TRITONSERVER_MEMORY_CPU_PINNED: {
    auto err = cudaSetDevice(*actual_memory_type_id);
    if ((err != cudaSuccess) && (err != cudaErrorNoDevice) &&
        (err != cudaErrorInsufficientDriver)) {
      return TRITONSERVER_ErrorNew(
          TRITONSERVER_ERROR_INTERNAL,
          std::string(
              "unable to recover current CUDA device: " +
              std::string(cudaGetErrorString(err)))
              .c_str());
    }

    err = cudaHostAlloc(&allocated_ptr, byte_size, cudaHostAllocPortable);
    if (err != cudaSuccess) {
      return TRITONSERVER_ErrorNew(
          TRITONSERVER_ERROR_INTERNAL,
          std::string(
              "cudaHostAlloc failed: " +
              std::string(cudaGetErrorString(err)))
              .c_str());
    }
    break;
  }

  case TRITONSERVER_MEMORY_GPU: {
    auto err = cudaSetDevice(*actual_memory_type_id);
    if ((err != cudaSuccess) && (err != cudaErrorNoDevice) &&
        (err != cudaErrorInsufficientDriver)) {
      return TRITONSERVER_ErrorNew(
          TRITONSERVER_ERROR_INTERNAL,
          std::string(
              "unable to recover current CUDA device: " +
              std::string(cudaGetErrorString(err)))
              .c_str());
    }

    err = cudaMalloc(&allocated_ptr, byte_size);
    if (err != cudaSuccess) {
      return TRITONSERVER_ErrorNew(
          TRITONSERVER_ERROR_INTERNAL,
          std::string(
              "cudaMalloc failed: " + std::string(cudaGetErrorString(err)))
              .c_str());
    }
    break;
  }

endif // TRITON_ENABLE_GPU

  // Use CPU memory if the requested memory type is unknown
  // (default case).
  case TRITONSERVER_MEMORY_CPU:
  default: {
    *actual_memory_type = TRITONSERVER_MEMORY_CPU;
    allocated_ptr = malloc(byte_size);
    break;
  }
}

// Pass the tensor name with buffer_userp so we can show it when
// releasing the buffer.
if (allocated_ptr != nullptr) {
  *buffer = allocated_ptr;
  *buffer_userp = new std::string(tensor_name);
  std::cout << "allocated " << byte_size << " bytes in "
            << TRITONSERVER_MemoryTypeString(*actual_memory_type)
            << " for result tensor " << tensor_name << std::endl;
}

}

return nullptr; // Success }

TRITONSERVER_Error ResponseRelease( TRITONSERVER_ResponseAllocator allocator, void buffer, void buffer_userp, size_t byte_size, TRITONSERVER_MemoryType memory_type, int64_t memory_type_id) { std::string name = nullptr; if (buffer_userp != nullptr) { name = reinterpret_cast<std::string>(buffer_userp); } else { name = new std::string(""); }

std::cout << "Releasing buffer " << buffer << " of size " << byte_size << " in " << TRITONSERVER_MemoryTypeString(memory_type) << " for result '" << *name << "'" << std::endl; switch (memory_type) { case TRITONSERVER_MEMORY_CPU: free(buffer); break;

ifdef TRITON_ENABLE_GPU

case TRITONSERVER_MEMORY_CPU_PINNED: {
  auto err = cudaSetDevice(memory_type_id);
  if (err == cudaSuccess) {
    err = cudaFreeHost(buffer);
  }
  if (err != cudaSuccess) {
    std::cerr << "error: failed to cudaFree " << buffer << ": "
              << cudaGetErrorString(err) << std::endl;
  }
  break;
}
case TRITONSERVER_MEMORY_GPU: {
  auto err = cudaSetDevice(memory_type_id);
  if (err == cudaSuccess) {
    err = cudaFree(buffer);
  }
  if (err != cudaSuccess) {
    std::cerr << "error: failed to cudaFree " << buffer << ": "
              << cudaGetErrorString(err) << std::endl;
  }
  break;
}

endif // TRITON_ENABLE_GPU

default:
  std::cerr << "error: unexpected buffer allocated in CUDA managed memory"
            << std::endl;
  break;

}

delete name;

return nullptr; // Success }

void InferRequestComplete( TRITONSERVER_InferenceRequest request, const uint32_t flags, void userp) { // We reuse the request so we don't delete it here. }

void InferResponseComplete( TRITONSERVER_InferenceResponse response, const uint32_t flags, void userp) { if (response != nullptr) { // Send 'response' to the future. std::promise<TRITONSERVER_InferenceResponse> p = reinterpret_cast<std::promise<TRITONSERVER_InferenceResponse>>(userp); p->set_value(response); delete p; } }

TRITONSERVER_Error ParseModelMetadata( const rapidjson::Document& model_metadata, bool is_int, bool* is_torch_model) { std::string seen_data_type; for (const auto& input : model_metadata["inputs"].GetArray()) { if (strcmp(input["datatype"].GetString(), "INT32") && strcmp(input["datatype"].GetString(), "FP32")) { return TRITONSERVER_ErrorNew( TRITONSERVER_ERROR_UNSUPPORTED, "simple lib example only supports model with data type INT32 or " "FP32"); } if (seen_data_type.empty()) { seen_data_type = input["datatype"].GetString(); } else if (strcmp(seen_data_type.c_str(), input["datatype"].GetString())) { return TRITONSERVER_ErrorNew( TRITONSERVER_ERROR_INVALID_ARG, "the inputs and outputs of 'simple' model must have the data type"); } } for (const auto& output : model_metadata["outputs"].GetArray()) { if (strcmp(output["datatype"].GetString(), "INT32") && strcmp(output["datatype"].GetString(), "FP32")) { return TRITONSERVER_ErrorNew( TRITONSERVER_ERROR_UNSUPPORTED, "simple lib example only supports model with data type INT32 or " "FP32"); } else if (strcmp(seen_data_type.c_str(), output["datatype"].GetString())) { return TRITONSERVER_ErrorNew( TRITONSERVER_ERROR_INVALID_ARG, "the inputs and outputs of 'simple' model must have the data type"); } }

is_int = (strcmp(seen_data_type.c_str(), "INT32") == 0); is_torch_model = (strcmp(model_metadata["platform"].GetString(), "pytorch_libtorch") == 0); return nullptr; }

template void GenerateInputData( std::vector input0_data, std::vector input1_data) { input0_data->resize(16 sizeof(T)); input1_data->resize(16 sizeof(T)); for (size_t i = 0; i < 16; ++i) { ((T)input0_data->data())[i] = i; ((T)input1_data->data())[i] = 1; } }

template void CompareResult( const std::string& output0_name, const std::string& output1_name, const void input0, const void input1, const char output0, const char output1) { for (size_t i = 0; i < 16; ++i) { std::cout << ((T)input0)[i] << " + " << ((T)input1)[i] << " = " << ((T)output0)[i] << std::endl; std::cout << ((T)input0)[i] << " - " << ((T)input1)[i] << " = " << ((T)output1)[i] << std::endl;

if ((((T*)input0)[i] + ((T*)input1)[i]) != ((T*)output0)[i]) {
  FAIL("incorrect sum in " + output0_name);
}
if ((((T*)input0)[i] - ((T*)input1)[i]) != ((T*)output1)[i]) {
  FAIL("incorrect difference in " + output1_name);
}

} }

void Check( TRITONSERVER_InferenceResponse* response, const std::vector& input0_data, const std::vector& input1_data, const std::string& output0, const std::string& output1, const size_t expected_byte_size, const TRITONSERVER_DataType expected_datatype, const bool is_int) { std::unordered_map<std::string, std::vector> output_data;

uint32_t output_count; FAIL_IF_ERR( TRITONSERVER_InferenceResponseOutputCount(response, &output_count), "getting number of response outputs"); if (output_count != 2) { FAIL("expecting 2 response outputs, got " + std::to_string(output_count)); }

for (uint32_t idx = 0; idx < output_count; ++idx) { const char cname; TRITONSERVER_DataType datatype; const int64_t shape; uint64_t dim_count; const void base; size_t byte_size; TRITONSERVER_MemoryType memory_type; int64_t memory_type_id; void userp;

FAIL_IF_ERR(
    TRITONSERVER_InferenceResponseOutput(
        response, idx, &cname, &datatype, &shape, &dim_count, &base,
        &byte_size, &memory_type, &memory_type_id, &userp),
    "getting output info");

if (cname == nullptr) {
  FAIL("unable to get output name");
}

std::string name(cname);
if ((name != output0) && (name != output1)) {
  FAIL("unexpected output '" + name + "'");
}

if ((dim_count != 2) || (shape[0] != 1) || (shape[1] != 16)) {
  FAIL("unexpected shape for '" + name + "'");
}

if (datatype != expected_datatype) {
  FAIL(
      "unexpected datatype '" +
      std::string(TRITONSERVER_DataTypeString(datatype)) + "' for '" +
      name + "'");
}

if (byte_size != expected_byte_size) {
  FAIL(
      "unexpected byte-size, expected " +
      std::to_string(expected_byte_size) + ", got " +
      std::to_string(byte_size) + " for " + name);
}

if (enforce_memory_type && (memory_type != requested_memory_type)) {
  FAIL(
      "unexpected memory type, expected to be allocated in " +
      std::string(TRITONSERVER_MemoryTypeString(requested_memory_type)) +
      ", got " + std::string(TRITONSERVER_MemoryTypeString(memory_type)) +
      ", id " + std::to_string(memory_type_id) + " for " + name);
}

// We make a copy of the data here... which we could avoid for
// performance reasons but ok for this simple example.
std::vector<char>& odata = output_data[name];
switch (memory_type) {
  case TRITONSERVER_MEMORY_CPU: {
    std::cout << name << " is stored in system memory" << std::endl;
    const char* cbase = reinterpret_cast<const char*>(base);
    odata.assign(cbase, cbase + byte_size);
    break;
  }

  case TRITONSERVER_MEMORY_CPU_PINNED: {
    std::cout << name << " is stored in pinned memory" << std::endl;
    const char* cbase = reinterpret_cast<const char*>(base);
    odata.assign(cbase, cbase + byte_size);
    break;
  }

ifdef TRITON_ENABLE_GPU

  case TRITONSERVER_MEMORY_GPU: {
    std::cout << name << " is stored in GPU memory" << std::endl;
    odata.reserve(byte_size);
    FAIL_IF_CUDA_ERR(
        cudaMemcpy(&odata[0], base, byte_size, cudaMemcpyDeviceToHost),
        "getting " + name + " data from GPU memory");
    break;
  }

endif

  default:
    FAIL("unexpected memory type");
}

}

if (is_int) { CompareResult( output0, output1, &input0_data[0], &input1_data[0], output_data[output0].data(), output_data[output1].data()); } else { CompareResult( output0, output1, &input0_data[0], &input1_data[0], output_data[output0].data(), output_data[output1].data()); } }

} // namespace

int main(int argc, char** argv) { std::string model_repository_path; int verbose_level = 0;

// Parse commandline... int opt; while ((opt = getopt(argc, argv, "vm:r:")) != -1) { switch (opt) { case 'm': { enforce_memory_type = true; if (!strcmp(optarg, "system")) { requested_memory_type = TRITONSERVER_MEMORY_CPU; } else if (!strcmp(optarg, "pinned")) { requested_memory_type = TRITONSERVER_MEMORY_CPU_PINNED; } else if (!strcmp(optarg, "gpu")) { requested_memory_type = TRITONSERVER_MEMORY_GPU; } else { Usage( argv, "-m must be used to specify one of the following types:" " <\"system\"|\"pinned\"|gpu>"); } break; } case 'r': model_repository_path = optarg; break; case 'v': verbose_level = 1; break; case '?': Usage(argv); break; } }

if (model_repository_path.empty()) { Usage(argv, "-r must be used to specify model repository path"); }

ifndef TRITON_ENABLE_GPU

if (enforce_memory_type && requested_memory_type != TRITONSERVER_MEMORY_CPU) { Usage(argv, "-m can only be set to \"system\" without enabling GPU"); }

endif // TRITON_ENABLE_GPU

// Check API version. This compares the API version of the // triton-server library linked into this application against the // API version of the header file used when compiling this // application. The API version of the shared library must be >= the // API version used when compiling this application. uint32_t api_version_major, api_version_minor; FAIL_IF_ERR( TRITONSERVER_ApiVersion(&api_version_major, &api_version_minor), "getting Triton API version"); if ((TRITONSERVER_API_VERSION_MAJOR != api_version_major) || (TRITONSERVER_API_VERSION_MINOR > api_version_minor)) { FAIL("triton server API version mismatch"); }

// Create the option setting to use when creating the inference // server object. TRITONSERVER_ServerOptions* server_options = nullptr; FAIL_IF_ERR( TRITONSERVER_ServerOptionsNew(&server_options), "creating server options"); FAIL_IF_ERR( TRITONSERVER_ServerOptionsSetModelRepositoryPath( server_options, model_repository_path.c_str()), "setting model repository path"); FAIL_IF_ERR( TRITONSERVER_ServerOptionsSetLogVerbose(server_options, verbose_level), "setting verbose logging level"); FAIL_IF_ERR( TRITONSERVER_ServerOptionsSetBackendDirectory( server_options, "/home/jingyao.ww/code/server/build/install/backends"), "setting backend directory"); FAIL_IF_ERR( TRITONSERVER_ServerOptionsSetRepoAgentDirectory( server_options, "/home/jingyao.ww/code/server/build/install/repoagents"), "setting repository agent directory"); FAIL_IF_ERR( TRITONSERVER_ServerOptionsSetStrictModelConfig(server_options, true), "setting strict model configuration");

ifdef TRITON_ENABLE_GPU

double min_compute_capability = TRITON_MIN_COMPUTE_CAPABILITY;

else

double min_compute_capability = 0;

endif // TRITON_ENABLE_GPU

FAIL_IF_ERR( TRITONSERVER_ServerOptionsSetMinSupportedComputeCapability( server_options, min_compute_capability), "setting minimum supported CUDA compute capability");

// Create the server object using the option settings. The server // object encapsulates all the functionality of the Triton server // and allows access to the Triton server API. Typically only a // single server object is needed by an application, but it is // allowed to create multiple server objects within a single // application. After the server object is created the server // options can be deleted. TRITONSERVER_Server* server_ptr = nullptr; FAIL_IF_ERR( TRITONSERVER_ServerNew(&server_ptr, server_options), "creating server object"); FAIL_IF_ERR( TRITONSERVER_ServerOptionsDelete(server_options), "deleting server options");

// Use a shared_ptr to manage the lifetime of the server object. std::shared_ptr server( server_ptr, TRITONSERVER_ServerDelete);

// Wait until the server is both live and ready. The server will not // appear "ready" until all models are loaded and ready to receive // inference requests. size_t health_iters = 0; while (true) { bool live, ready; FAIL_IF_ERR( TRITONSERVER_ServerIsLive(server.get(), &live), "unable to get server liveness"); FAIL_IF_ERR( TRITONSERVER_ServerIsReady(server.get(), &ready), "unable to get server readiness"); std::cout << "Server Health: live " << live << ", ready " << ready << std::endl; if (live && ready) { break; }

if (++health_iters >= 10) {
  FAIL("failed to find healthy inference server");
}

std::this_thread::sleep_for(std::chrono::milliseconds(500));

}

// Server metadata can be accessed using the server object. The // metadata is returned as an abstract TRITONSERVER_Message that can // be converted to JSON for further processing. { TRITONSERVER_Message server_metadata_message; FAIL_IF_ERR( TRITONSERVER_ServerMetadata(server.get(), &server_metadata_message), "unable to get server metadata message"); const char buffer; size_t byte_size; FAIL_IF_ERR( TRITONSERVER_MessageSerializeToJson( server_metadata_message, &buffer, &byte_size), "unable to serialize server metadata message");

std::cout << "Server Metadata:" << std::endl;
std::cout << std::string(buffer, byte_size) << std::endl;

FAIL_IF_ERR(
    TRITONSERVER_MessageDelete(server_metadata_message),
    "deleting server metadata message");

}

const std::string model_name("densenet_onnx");

// We already waited for the server to be ready, above, so we know // that all models are also ready. But as an example we also wait // for a specific model to become available. bool is_torch_model = false; bool is_int = true; bool is_ready = false; health_iters = 0; while (!is_ready) { FAIL_IF_ERR( TRITONSERVER_ServerModelIsReady( server.get(), model_name.c_str(), 1 / model_version /, &is_ready), "unable to get model readiness"); if (!is_ready) { if (++health_iters >= 10) { FAIL("model failed to be ready in 10 iterations"); } std::this_thread::sleep_for(std::chrono::milliseconds(500)); continue; }

TRITONSERVER_Message* model_metadata_message;
FAIL_IF_ERR(
    TRITONSERVER_ServerModelMetadata(
        server.get(), model_name.c_str(), 1, &model_metadata_message),
    "unable to get model metadata message");
const char* buffer;
size_t byte_size;
FAIL_IF_ERR(
    TRITONSERVER_MessageSerializeToJson(
        model_metadata_message, &buffer, &byte_size),
    "unable to serialize model metadata");

// Parse the JSON string that represents the model metadata into a
// JSON document. We use rapidjson for this parsing but any JSON
// parser can be used.
rapidjson::Document model_metadata;
model_metadata.Parse(buffer, byte_size);
if (model_metadata.HasParseError()) {
  FAIL(
      "error: failed to parse model metadata from JSON: " +
      std::string(GetParseError_En(model_metadata.GetParseError())) +
      " at " + std::to_string(model_metadata.GetErrorOffset()));
}

FAIL_IF_ERR(
    TRITONSERVER_MessageDelete(model_metadata_message),
    "deleting model metadata message");

// Now that we have a document representation of the model
// metadata, we can query it to extract some information about the
// model.
if (strcmp(model_metadata["name"].GetString(), model_name.c_str())) {
  FAIL("unable to find metadata for model");
}

bool found_version = false;
if (model_metadata.HasMember("versions")) {
  for (const auto& version : model_metadata["versions"].GetArray()) {
    if (strcmp(version.GetString(), "1") == 0) {
      found_version = true;
      break;
    }
  }
}
if (!found_version) {
  FAIL("unable to find version 1 status for model");
}

}

// When triton needs a buffer to hold an output tensor, it will ask // us to provide the buffer. In this way we can have any buffer // management and sharing strategy that we want. To communicate to // triton the functions that we want it to call to perform the // allocations, we create a "response allocator" object. We pass // this response allocate object to triton when requesting // inference. We can reuse this response allocate object for any // number of inference requests. TRITONSERVER_ResponseAllocator allocator = nullptr; FAIL_IF_ERR( TRITONSERVER_ResponseAllocatorNew( &allocator, ResponseAlloc, ResponseRelease, nullptr / start_fn */), "creating response allocator");

// Create an inference request object. The inference request object // is where we set the name of the model we want to use for // inference and the input tensors. TRITONSERVER_InferenceRequest irequest = nullptr; FAIL_IF_ERR( TRITONSERVER_InferenceRequestNew( &irequest, server.get(), model_name.c_str(), 1 / model_version */), "creating inference request");

FAIL_IF_ERR( TRITONSERVER_InferenceRequestSetId(irequest, "my_request_id"), "setting ID for the request");

FAIL_IF_ERR( TRITONSERVER_InferenceRequestSetReleaseCallback( irequest, InferRequestComplete, nullptr / request_release_userp /), "setting request release callback");

// Load image and preprocess cv::Mat image = cv::imread("/home/jingyao.ww/code/data/smoke_a388.jpg"); cv::resize(image, image, cv::Size(224, 224), cv::INTER_LINEAR); cv::cvtColor(image, image, cv::COLOR_BGR2RGB); image.convertTo(image, CV_32F, 1.0f / 255.0f); std::vector input_data; if (image.isContinuous()) { input_data.assign(image.data, image.data + image.total()image.channels()); } else { for (int i = 0; i < image.rows; ++i) { input_data.insert(input_data.end(), image.ptr(i), image.ptr(i) + image.cols image.channels()); } }

auto input = "data_0";

size_t input_size = input_data.size();
const TRITONSERVER_DataType input_datatype = TRITONSERVER_TYPE_FP32;
std::vector input_shape({image.channels(), image.rows, image.cols});

FAIL_IF_ERR( TRITONSERVER_InferenceRequestAddInput( irequest, input, input_datatype, &input_shape[0], input_size), "setting input 0 meta-data for the request");

// #ifdef TRITON_ENABLE_GPU // std::unique_ptr<void, decltype(cuda_data_deleter)> input0_gpu( // nullptr, cuda_data_deleter); // bool use_cuda_memory = // (enforce_memory_type && // (requested_memory_type != TRITONSERVER_MEMORY_CPU)); // if (use_cuda_memory) { // FAIL_IF_CUDA_ERR(cudaSetDevice(0), "setting CUDA device to device 0"); // if (requested_memory_type != TRITONSERVER_MEMORY_CPU_PINNED) { // void dst; // FAIL_IF_CUDA_ERR( // cudaMalloc(&dst, input_size), // "allocating GPU memory for INPUT0 data"); // input0_gpu.reset(dst); // FAIL_IF_CUDA_ERR( // cudaMemcpy(dst, &input_data[0], input_size, cudaMemcpyHostToDevice), // "setting INPUT0 data in GPU memory"); // } else { // void dst; // FAIL_IF_CUDA_ERR( // cudaHostAlloc(&dst, input_size, cudaHostAllocPortable), // "allocating pinned memory for INPUT0 data"); // input0_gpu.reset(dst); // FAIL_IF_CUDA_ERR( // cudaMemcpy(dst, &input_data[0], input_size, cudaMemcpyHostToHost), // "setting INPUT0 data in pinned memory"); // } // }

// input0_base = use_cuda_memory ? input0_gpu.get() : &input_data[0]; //#endif // TRITON_ENABLE_GPU const void* input0_base = &input_data[0];

FAIL_IF_ERR( TRITONSERVER_InferenceRequestAppendInputData( irequest, input, input0_base, input_size, requested_memory_type, 0 / memory_type_id /), "assigning INPUT0 data");

// Perform inference by calling TRITONSERVER_ServerInferAsync. This // call is asychronous and therefore returns immediately. The // completion of the inference and delivery of the response is done // by triton by calling the "response complete" callback functions // (InferResponseComplete in this case). { auto p = new std::promise<TRITONSERVER_InferenceResponse>(); std::future<TRITONSERVER_InferenceResponse> completed = p->get_future();

FAIL_IF_ERR(
    TRITONSERVER_InferenceRequestSetResponseCallback(
    irequest, allocator, nullptr /* response_allocator_userp */,
    InferResponseComplete, reinterpret_cast<void*>(p)),
    "setting response callback");

FAIL_IF_ERR(TRITONSERVER_ServerInferAsync(server.get(), irequest, nullptr /* trace */),"running inference");

TRITONSERVER_InferenceResponse* completed_response = completed.get();

const char* cname;
TRITONSERVER_DataType datatype;
const int64_t* shape;
uint64_t dim_count;
const void* base;
size_t byte_size;
TRITONSERVER_MemoryType memory_type;
int64_t memory_type_id;
void* userp;
const char* buffer;

FAIL_IF_ERR(
    TRITONSERVER_InferenceResponseOutput(
        completed_response, 0, &cname, &datatype, &shape, &dim_count, &base,
        &byte_size, &memory_type, &memory_type_id, &userp),
    "getting output info");

size_t num_elements = byte_size / sizeof(float);
std::vector<float> output_data_ptr(dim_count);
std::memcpy(output_data_ptr.data(), base, byte_size);

// Open output file
std::ofstream output_file("/home/jingyao.ww/code/output.txt");
if (!output_file.is_open()) {
    printf("Failed to open output file.\n");
    return 1;
}

for (size_t i = 0; i < num_elements; i++) {
    output_file << output_data_ptr[i] << " ";  
}
output_file.close();

FAIL_IF_ERR(TRITONSERVER_InferenceResponseError(completed_response),"response status"); 

}

FAIL_IF_ERR( TRITONSERVER_InferenceRequestDelete(irequest), "deleting inference request");

FAIL_IF_ERR( TRITONSERVER_ResponseAllocatorDelete(allocator), "deleting response allocator");

return 0; }

“” Then I compile triton as below: cd ${TRITON_SERVER_ROOT} mkdir builddir cd builddir cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX:PATH=builddir/tritonserver/install -DTRITON_COMMON_REPO_TAG:STRING=main -DTRITON_CORE_REPO_TAG:STRING=main -DTRITON_BACKEND_REPO_TAG:STRING=main -DTRITON_THIRD_PARTY_REPO_TAG:STRING=main -DTRITON_ENABLE_LOGGING:BOOL=ON -DTRITON_ENABLE_STATS:BOOL=ON -DTRITON_ENABLE_METRICS:BOOL=ON -DTRITON_ENABLE_METRICS_GPU:BOOL=ON -DTRITON_ENABLE_TRACING:BOOL=ON -DTRITON_ENABLE_GPU:BOOL=ON -DTRITON_MIN_COMPUTE_CAPABILITY=6.0 -DTRITON_ENABLE_GRPC:BOOL=ON -DTRITON_ENABLE_HTTP:BOOL=ON -DTRITON_ENABLE_GCS:BOOL=OFF -DTRITON_ENABLE_S3:BOOL=OFF -DTRITON_ENABLE_AZURE_STORAGE:BOOL=OFF -DTRITON_ENABLE_TENSORFLOW=OFF -DTRITON_ENABLE_TENSORRT=OFF -DTRITON_ENABLE_CUSTOM=ON -DTRITON_ENABLE_ENSEMBLE=OFF -DTRITON_ENABLE_ONNXRUNTIME=OFF -DTRITON_ENABLE_PYTHON=OFF -DTRITON_ENABLE_DALI=OFF -DTRITON_ENABLE_PYTORCH=OFF -DTRITON_EXTRA_LIB_PATHS=/opt/tritonserver/lib ..

make -j32 At last run: ./simple -r /home/jingyao.ww/code/server/docs/examples/model_repository -m gpu

Expected behavior: The result is: get huge matrix, cannot get detection results, when using C API to implement " densenet_onnx" model inference like simple.cc

wangweilovelife commented 1 year ago

Please follow the bug report template. We need this to be fully reproducible on our side to be able to investigate the issue. Please provide clear steps. Also, let us know what the issue actually is, because it looks like you are getting an output.

--

Description A clear and concise description of what the bug is.

Triton Information What version of Triton are you using?

Are you using the Triton container or did you build it yourself?

To Reproduce Steps to reproduce the behavior.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Expected behavior A clear and concise description of what you expected to happen.

hi, I have identified the function that caused the issue. FAIL_IF_ERR( TRITONSERVER_ServerInferAsync( server.get(), irequest, nullptr /* trace */), "running inference");

What causes this error?

wangweilovelife commented 1 year ago

I have fixed it.