tensorflow / profiler

A profiling and performance analysis tool for TensorFlow
Apache License 2.0
355 stars 55 forks source link

Increasing granularity of trace #29

Open mossjacob opened 4 years ago

mossjacob commented 4 years ago

trace trace1

Here I have part of the trace of my model which contains some custom MCMC samplers along with a No U-Turn Sampler from TFP. I'm trying to diagnose why running on the GPU is taking so much longer than on the CPU. I'm wondering if there's a way of getting more precise information about what is being processed. In the image the longest blocks don't give any information about what's specifically going that makes them take so long.

Here is the full log: 20200527-152907.zip

Furthermore when I run the same code just on the CPU, the trace no longer shows the mcmc_sample_chain blocks: trace2

Why is there such a big difference between the GPU and CPU trace? Can I get more specific information for the GPU trace?

ckluk commented 4 years ago

Hi, I look at your profile log. From the Overview Page result (see below), ~86% of your step time is spent on the "Kernel Launch" and ~8% is spent on "All Other" time. This probably means that the Ops that you execute on GPU are too tiny so that the overhead of launching them dominates. Perhaps, you need to increase the execution granularity on the GPU by "fusing" multiple kernels together. Hope this helps, -ck

[image: image.png]

On Wed, May 27, 2020 at 1:20 PM Jacob Moss notifications@github.com wrote:

[image: trace] https://user-images.githubusercontent.com/3687111/83066964-c40fe480-a05d-11ea-91ae-ce5265955ea4.PNG [image: trace1] https://user-images.githubusercontent.com/3687111/83067760-21f0fc00-a05f-11ea-9619-7c7181878ebb.PNG

Here I have part of the trace of my model which contains some custom MCMC samplers along with a No U-Turn Sampler from TFP. I'm trying to diagnose why running on the GPU is taking so much longer than on the CPU. I'm wondering if there's a way of getting more precise information about what is being processed. In the image the longest blocks don't give any information about what's specifically going that makes them take so long.

Here is the full log: 20200527-152907.zip https://github.com/tensorflow/profiler/files/4691555/20200527-152907.zip

Furthermore when I run the same code just on the CPU, the trace no longer shows the mcmc_sample_chain blocks: [image: trace2] https://user-images.githubusercontent.com/3687111/83067647-f40bb780-a05e-11ea-8f53-d33979687594.PNG

Why is there such a big difference between the GPU and CPU trace? Can I get more specific information for the GPU trace?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/profiler/issues/29, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE33L3NTDBA6O43EYBXMPA3RTVYZDANCNFSM4NMPA3OQ .

mossjacob commented 4 years ago

Thanks @ckluk , but is there a way to see more information for the blocks after around 25 seconds? Because beyond that point the trace doesn't give any information about what's going on

ckluk commented 4 years ago

The long spans that you see after 25 secs are simply scopes of a bunch of related activities. Usually you will see the individual activities spreaded over many threads (i.e. rows on the trace viewer).

On Sat, May 30, 2020 at 3:32 AM Jacob Moss notifications@github.com wrote:

Thanks @ckluk https://github.com/ckluk , but is there a way to see more information for the blocks after around 25 seconds? Because beyond that point the trace doesn't give any information about what's going on

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/profiler/issues/29#issuecomment-636311600, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE33L3OYQWHKFCGTSVHBPILRUDOCJANCNFSM4NMPA3OQ .

ydennisy commented 3 years ago

@ckluk how does one "fuse" multiple kernels together? Especially when using something like keras?