GPU warmup phase based on number of kernels

I see that warmup phase in Nvidia code is based on seconds, e.g. ed-unet code:

    // Perform a brief warmup
    std::cout << "Starting warmup. Running for a minimum of " << FLAGS_warmup_duration
              << " seconds." << std::endl;
    auto tStart = std::chrono::high_resolution_clock::now();
    sut->Warmup(FLAGS_warmup_duration);
    double elapsed =
        std::chrono::duration<float>(std::chrono::high_resolution_clock::now() - tStart).count();
    std::cout << "Finished warmup. Ran for " << elapsed << "s." << std::endl;

So in the output I see

Starting warmup. Running for a minimum of 5 seconds.

This creates some problems when I want to do detailed analysis. For example, with Nsight Compute, I see:

Starting warmup. Running for a minimum of 5 seconds.
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 0 (1/113183): 0%....50%....100% - 1 pass
...
==PROF== Profiling "conv3d_1x1x1_k4" - 119 (120/113183): 0%....50%....100% - 1 pass
Finished warmup. Ran for 6.16464s.
Starting running actual test.

That means 120 kernels have been profiled with two metrics I supplied (note the 1 pass of the profiler). However, if I supply more metrics which slows down the profiling speed, I see warmup finishes at a different kernel number. In the output below, note that each profiled kernel needed 5 passes.

Starting warmup. Running for a minimum of 5 seconds.
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 0 (1/113183): 0%....50%....100% - 5 passes
...
==PROF== Profiling "conv3d_1x1x1_k4" - 59 (60/113183): 0%....50%....100% - 5 passes
Finished warmup. Ran for 77.537s.
Starting running actual test.

Is there a way to make the warmup phase based on the number of kernels? Current behavior creates inconsistency in analysis.

mlcommons / inference_results_v2.0

GPU warmup phase based on number of kernels #17