iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.6k stars 583 forks source link

DeferredReleases adds 50ms to the latency reported by iree-benchmark-module on Android #3939

Closed bjacob closed 3 years ago

bjacob commented 3 years ago

Found this accidentally while staring at Tracy traces (so count this as possibly the first use of Tracy on Android!)

In hal_module.cc, in ExSubmitAndWait, the zone labelled HALModuleState::DeferredReleases is taking 50ms, adding 50ms to the latency reported by iree-benchmark-module on the perf burndown workload. This diff,

--- a/iree/modules/hal/hal_module.cc
+++ b/iree/modules/hal/hal_module.cc
@@ -158,7 +158,7 @@ class HALModuleState final {

     {
       IREE_TRACE_SCOPE0("HALModuleState::DeferredReleases");
-      for (auto& ref : deferred_releases_) {
+      if (0) for (auto& ref : deferred_releases_) {
         iree_vm_ref_release(&ref);
       }
       deferred_releases_.clear();

improves the reported latency (MobileBert on Pixel4 core 7) from 920 ms down to 870 ms.

The tracy trace shows what's happening: each of the iterations of the for loop above is performing some free(), which you would think should be cheap, but there are many of them, and on Android, free calls madvise to hint the OS to release memory immediately.

What would be an appropriate way to avoid performing so many free calls during shutdown?

bjacob commented 3 years ago

@benvanik would it make sense, maybe as part of the ongoing work on memory allocation, to make it so that these small allocations are into a single arena so that on shutdown we just need to deallocate that arena?

bjacob commented 3 years ago

It looks like this is taken care of already by @asaadaldien 's https://github.com/google/iree/tree/ataei-pre_allocation:

Baseline: 920 ms With my patch above: 870 ms With Ahmed's pre_allocation: 820 ms With both my patch above and Ahmed's pre_allocation: still 820 ms

Confirm there's indeed nothing more to this / close this issue when pre_allocation gets merged?

asaadaldien commented 3 years ago

Make sense! if free is releasing host buffers here, it will have 0 cost in pre_allocation branch

bjacob commented 3 years ago

indeed, that's what free was doing here. please mark your own branch as fixing this issue.

benvanik commented 3 years ago

This is expected: it's the other half of the allocations (gotta free your memory sometime) and what I'll be fixing next week by not allocating the memory in the first place :)

Dupe of #1888.