Open Manutea opened 1 year ago
Is your benchmark code just adding a loop to this test - https://github.com/microsoft/onnxruntime/blob/870b0bc305e163801b47dc4989927a2274cf1e07/onnxruntime/test/shared_lib/test_inference.cc#L1759 ?
The block of code for which duration is measured has an allocation (GetAllocation()
) and 2 device synchronizations (binding.SynchronizeInputs();
and binding.SynchronizeOutputs();
). Even if the allocation is for the same number of bytes each time and there is no real allocation every time (because of any underlying memory pool in the allocator), I would consider moving that out of the time measurement code block. In any case, the device synchronization(s) that you have there in order to ensure that the copy on the default stream has completed (cudaMemcpy(input.get(), image.data(), sizeof(float) * image.size(), cudaMemcpyHostToDevice);
) may be contributing to the variance if the device was doing something else at that time. I think
SynchronizeInputs()
is superfluous because cudaMemcpy()
is in any case a blocking call and the data would have copied over to the cuda buffer by the end of that call. SychronizeOutputs()
isn't really needed either as Run()
should do a stream sync before returning.
Do you see such variances when the input is already on the right device and no IOBinding is used (i.e.) you supply OrtValues backed by CUDA memory via regular Run()
(no IOBinding) ?
Hello, thank you for replying.
I tried without using IOBiding.
void onnx_benchmark_GPU(std::string &modelPath, int deviceId)
{
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "ModelInference");
Ort::SessionOptions options;
options.EnableProfiling("gpu_profile_file");
OrtStatus* status = OrtSessionOptionsAppendExecutionProvider_CUDA(options, deviceId);
Ort::Session session(env, modelPath.c_str(), options);
std::array<float, 1*3*224*224> input_data;
input_data.fill(150.0f);
std::vector<int64_t> input_dims = {1, 3, 224, 224};
auto memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
Ort::Value input_tensor = Ort::Value::CreateTensor<float>(memory_info, input_data.data(), input_data.size(), input_dims.data(), input_dims.size());
const char* inputNames[] = {"input"};
const char* outputNames[] = {"output"};
std::ofstream f("mesureonnx.txt");
for(int i=0; i<500; ++i)
{
auto startGPU = std::chrono::high_resolution_clock::now();
auto output_tensors = session.Run(Ort::RunOptions{nullptr}, inputNames, &input_tensor, 1, outputNames, 1);
auto endGPU = std::chrono::high_resolution_clock::now();
auto durationGPU = std::chrono::duration_cast<std::chrono::nanoseconds>(endGPU - startGPU);
f<<i<<" -- GPU inference duration : "<<durationGPU.count()<< "ns" << std::endl;
}
}
I've also noticed here that inferences are sometimes too slow, and in the same task id for each test.
With perffeto And nvvp
I've also just tried it on the CPU provider. And, I also have a CPU waiting for something.
void onnx_benchmark_CPU(std::string &filePath, std::string &modelPath, std::string &inputTensorName, std::string &outputTensorName, int batch)
{
std::vector<float> image(batch * 3 * 224 * 224, 150);
std::vector<int64_t> inputDims = {batch, 3, 224, 224};
std::vector<int64_t> outputDims = {batch, 1000};
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "InferenceCPU");
Ort::SessionOptions sessionOptions;
sessionOptions.EnableProfiling("cpu_profile_file");
sessionOptions.SetIntraOpNumThreads(1);
Ort::Session session(env, modelPath.c_str(), sessionOptions);
auto memoryInfo = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
Ort::Value inputTensor = Ort::Value::CreateTensor<float>(memoryInfo, image.data(), image.size(), inputDims.data(), inputDims.size());
const char* inputNames[] = {inputTensorName.c_str()};
const char* outputNames[] = {outputTensorName.c_str()};
std::ofstream file(filePath, std::ios::app);
int nbIterations = 800;
for (int i = 0; i < nbIterations; i++)
{
auto startCPU = std::chrono::high_resolution_clock::now();
auto outputTensors = session.Run(Ort::RunOptions{nullptr}, inputNames, &inputTensor, 1, outputNames, 1);
auto endCPU = std::chrono::high_resolution_clock::now();
auto durationCPU = std::chrono::duration_cast<std::chrono::nanoseconds>(endCPU - startCPU);
std::cout<<"CPU inference duration : "<<durationCPU.count()<< " ns" <<std::endl;
if(i>0)
file << batch << " " << durationCPU.count() << "\n";
}
file.close();
}
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Describe the issue
Some inferences (taskId ~250, 450, 800 and 1700) cost more than others. It seems that during this time, the GPU does nothing and is on standby. I have the same problem with P100 GPUs or RTX8000s. I've tried the AlexNet or GoogleNet models.
Perhaps this is related to this discussion ? #14023
I also find these standbys with onnxruntime_perf_test, the command :
./onnxruntime_perf_test -I -S 1 -e cuda -r 2048 -p profile.json -s /data/model/googlenet/dynamic_batch_googlenet_opt.onnx
To reproduce
Urgency
No response
Platform
Linux
OS Version
CentOS Linux release 7.6.1810 (Core)
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
ONNX Runtime 1.15.0
ONNX Runtime API
C++
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.7
Model File
No response
Is this a quantized model?
Unknown