google-ai-edge / mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
https://ai.google.dev/edge/mediapipe
Apache License 2.0
27.36k stars 5.14k forks source link

the Tensor::ReadBackGpuToCpu function is show #5694

Open zaykl opened 6 days ago

zaykl commented 6 days ago

Hi

when I run the holistic Profiling test with Jetson Orin, I found this function execute very slow(Opengl GPU memory sync to CPU memory). Like the TensorsToFloatsCalculator, TensorsToClassificationCalculator will all need to run ReadBackGpuToCpu.

https://github.com/google-ai-edge/mediapipe/blob/214f44113e46505bb0bffbc29f01a76bb107a146/mediapipe/framework/formats/tensor.cc#L595
#if MEDIAPIPE_OPENGL_ES_VERSION >= MEDIAPIPE_OPENGL_ES_30
#if MEDIAPIPE_OPENGL_ES_VERSION >= MEDIAPIPE_OPENGL_ES_31
  // TODO: we cannot just grab the GL context's lock while holding
  // the view mutex here.
  if (valid_ & kValidOpenGlBuffer) {
    gl_context_->Run([this]() {
      glBindBuffer(GL_SHADER_STORAGE_BUFFER, opengl_buffer_);
      const void* ptr = glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, bytes(),
                                         GL_MAP_READ_BIT);
      std::memcpy(cpu_buffer_, ptr, bytes());
      glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
    });
    return absl::OkStatus();
  }
#endif  // MEDIAPIPE_OPENGL_ES_VERSION >= MEDIAPIPE_OPENGL_ES_31

So I found if handle one frame need 40ms. this function will take 30ms to execute. Is there any idea to make this function run fast?

kuaashish commented 5 days ago

Hi @zaykl,

Currently, Jetson is not officially supported. As outlined in our documentation here, the only edge device we support is the Raspberry Pi 64. Therefore, we are unable to offer support for Jetson at this time, and GPU support is limited to the Pip package on macOS and standard Ubuntu.

However, you might find the community plugin (https://github.com/anion0278/mediapipe-jetson), which supports Jetson and GPU, to be useful. Please note that this plugin is based on legacy solutions.

Thank you!!

zaykl commented 5 days ago

Hi @zaykl,

Currently, Jetson is not officially supported. As outlined in our documentation here, the only edge device we support is the Raspberry Pi 64. Therefore, we are unable to offer support for Jetson at this time, and GPU support is limited to the Pip package on macOS and standard Ubuntu.

However, you might find the community plugin (https://github.com/anion0278/mediapipe-jetson), which supports Jetson and GPU, to be useful. Please note that this plugin is based on legacy solutions.

Thank you!!

Hi @kuaashish

Yes, I have tested this community plugin. This project just for building tensorflow with cuda(https://github.com/google-ai-edge/mediapipe/blob/v0.10.14/docs/getting_started/gpu_support.md), As we know, the holistic graph just support tflite model. And I can't convert tflite model to tensorflow pb file. But anyway, thx for your suggest.

kuaashish commented 4 days ago

Hi @zaykl,

Our team has reviewed the request, and unfortunately, it is not currently possible to convert the TFLite model to a TensorFlow .pb file. This issue has already been raised, as seen here: https://github.com/google-ai-edge/mediapipe/issues/5630. Regrettably, we are unable to assist further with this matter.

Thank you!!

tyrmullen commented 3 days ago

While we cannot assist you, one piece of advice I can offer is that you are likely profiling in the wrong place.

Synchronously reading back from GPU to CPU usually means that the CPU thread waits until all previously enqueued GPU work has been finished. So if you're using CPU-based profiling methods, it will look like this one "read back" call takes a very long time, when in fact what's really happening is that it's actually doing nothing and just waiting for a long time until the GPU is finished. Therefore, you'll probably need to profile the GPU work to figure out what's actually taking a long time and work towards speeding things up.

TL;DR: I would suspect that the actual reading back from GPU to CPU is very fast, but it looks like it's taking forever because you're also timing how long it just waits (doing nothing) for GPU to finish, so you should instead be profiling GPU, not CPU.

Hope that helps!

kuaashish commented 3 days ago

Hi @tyrmullen,

Thank you for adding more pointers, @zaykl. Could you please review the information above to see if it's helpful for your situation?

Thank you!!

zaykl commented 2 days ago

Yes, It's helpfull for me. I have try to make the Jetson orin CPU run with performance mode. And the GPU memory sync to CPU memory time cost reduce to 15ms.

https://docs.nvidia.com/jetson/archives/r35.4.1/DeveloperGuide/text/SD/PlatformPowerAndPerformance/JetsonOrinNanoSeriesJetsonOrinNxSeriesAndJetsonAgxOrinSeries.html#supported-modes-and-power-efficiency https://forums.developer.nvidia.com/t/technical-problems-with-the-jetson-orin-nx-16gb/272462