starstream commented 2 months ago

Your question

How can I profile bubble time in pipeline parallelism?

huanggx-sea commented 2 months ago

We need to write some scripts for that manually. I will put our plotting scripts to this repo later. If you are trying to draw this figure for your project, you need to write your own as it relies on the NVTX events in the codes. Here are all the details below.

So what we want here is to find out all the CUDA kernels for each F/B/W/Optimizer so that we can get the GPU start/end time for it.

The main workflow is:

(1) Use nsys profile to do profiling and get the nsys-rep files. The CUDA and NVTX events will be in these files. Export nsys-rep files to sqlite files.
(2) Write some SQL to query related data in the sqlite files and turn them into a series of events with start time and end time.
(3) Use the event data above to plot.

The 2nd is the most tricky part. I will break them down one by one.

(1) Profiling Using nsys

Technically, there are 3 kinds of events involved:

CUDA kernel on GPU
CUDA call on CPU
NVTX events

In our code, when profiling of Megatron is turned on, torch.cuda.nvtx.range_push/range_pop will be used to mark the start/end time of F,B,W,Optimizer on CPU. Through this NVTX event duration, we can find out all the related CUDA calls on CPU, then the CUDA kernels on the GPU. With this NVTX → kernels mapping, we can know the start/end time of each F,BW,Optimizer function on GPU.

So first thing first, launch Megatron to do training and get the profiling data using nsys command.

nsys profile -s none -t nvtx,cuda <Command to run megatron like "torch run">

You may want to add more options to nsys like the following depending on your usage:

--capture-range=cudaProfilerApi --capture-range-end=stop

Now you get the nsys-rep files. Export them to sqlite files using nsys or Nsight.

nsys export --type sqlite --output <sqlite file> <nsys-rep file>

(2) Event Reconstruction from Sqlite

Now the heavy lifting part.

There are 3 tables corresponding to the 3 kinds of events:

CUDA kernel on GPU: CUPTI_ACTIVITY_KIND_KERNEL
CUDA call on CPU: CUPTI_ACTIVITY_KIND_RUNTIME
NVTX events: NVTX_EVENTS

You can use .schema command in sqlite3 to check these table out.

What we are basically going to do is to use SQL and script codes to reconstruct the mapping between:

NVTX events → CPU CUDA calls → GPU kernels

Linking CPU CUDA Calls to GPU kernels

This is pretty straightforward, there’s a correlationId in both CUPTI_ACTIVITY_KIND_KERNEL and CUPTI_ACTIVITY_KIND_RUNTIME. We can just join these 2 tables.

But there are still some details to take note.

SELECT runtime.correlationId, kernel.start, kernel.end, runtime.start, runtime.end,
       kernel.deviceId, kernel.shortName, runtime.nameId, runtime.globalTid
FROM CUPTI_ACTIVITY_KIND_KERNEL as kernel, CUPTI_ACTIVITY_KIND_RUNTIME as runtime
WHERE runtime.correlationId = kernel.correlationId
      and kernel.globalPid / 0x1000000 % 0x1000000 = runtime.globalTid / 0x1000000 % 0x1000000
      and runtime.globalTid in (
          SELECT distinct globalTid FROM NVTX_EVENTS 
              WHERE text like "F%" or text like "B%" or text like "W%" or text = "Optimizer"
          );

Firstly, different processes may generate the same correlationId value so also need to take PID into account when joining the records. PID is globalTid / 0x1000000 % 0x1000000 and globalPid / 0x1000000 % 0x1000000 according to the NVIDIA doc.

Secondly, we are just interested in the threads where our NVTX events happen. So need to filter them through globalTid.

Thirdly, the kernel.shortName and runtime.nameId are just ids, not strings. We need to get back the strings in table StringIds by these ids. I prefer to do this in the script, not SQL.

Linking NVTX Events to CPU CUDA Calls

Now comes the hard part. There’s no direct connection between NVTX events and CUDA calls.

The only thing we can rely on is the start/end time of NVTX events and CUDA calls. A CUDA call belongs to a NVTX event only if the call happens inside the event in the same thread. More accurately, since NVTX events can be stacked, we normally only care about the stack-top NVTX event.

For example, if we have 3 ranges A, B, and C in the same thread:

Range A push
Range B push
CUDA API call
Range B pop
Range C push
Range C pop
Range A pop

AAAAAAAAAAAAAAAAAAAA
  BBBBBBBBBBBB  CCC
   CUDA call

Time --->

Range B is the likely the F,B,W NVTX event we’re looking for.

The SQL to pull out all the related NVTX events would be:

SELECT start, end, globalTid, text from NVTX_EVENTS
WHERE eventType = 35 or eventType = 36 and
      (text like "F%" or text like "B%" or text like "W%" or text = "Optimizer")

The event type 35,36 here are NvtxPushPopRange/NvtxPushPopRange, which can be found in table ENUM_NSYS_EVENT_TYPE. I’m not sure whether these values will always be constant or could be changed.

Then we need to write a script to match the NVTX events and CUDA API calls by checking the duration range and thread id. Since the data are large in many cases, you may need to optimize the matching algorithm a bit using binary search or other methods.

Now you know for each NVTX event, F/B/W/Optimizer in our project, which CUDA kernels belong to it. We can easily get the start/end time of this event by the start time of the first kernel and the end time of the last kernel. We can also get the deviceId from table CUPTI_ACTIVITY_KIND_KERNEL.

I would like to thank Jay and Holly from Nvidia Support for helping us to figure all this all.

(3) Plotting

We use Python and lib drawsvg for plotting. You can refer to the code of our playground.

If you are running multiple servers, you will need to combine the sqlites on each server together.

starstream commented 1 month ago

I run it with the following command

nsys profile -s none -t nvtx,cuda -o pp_profing.nsys-rep -f true --capture-range=cudaProfilerApi --capture-range-end=stop \
torchrun $DISTRIBUTED_ARGS ../pretrain_gpt.py \
...
--fp16 \
--use-flash-attn \
--use-distributed-optimizer \
--profile

nsys export --type sqlite --output pp_profing.sqlite pp_profing_nofa.nsys-rep

python load_nsys_events.py ./

However, the following error occurs


no nvtx event found for kernel DeviceCompactInitKernel
no nvtx event found for kernel DeviceSelectSweepKernel
.....
no nvtx event found for kernel DeviceCompactInitKernel
no nvtx event found for kernel DeviceCompactInitKernel                                                                                                                                      
no nvtx event found for kernel DeviceSelectSweepKernel
Traceback (most recent call last):                                                                                                                                                          [338/47259]
  File "/nsys/load_nsys_events.py", line 431, in <module>
    main(sqlite_dir, sqlite_dir_other, True, True)
  File "/nsys/load_nsys_events.py", line 397, in main
    nvtx_kernels_map = create_nvtx_kernels_map(sqlite_files, ignore_comm, 0)
  File "/nsys/load_nsys_events.py", line 363, in create_nvtx_kernels_map
    server_events.append((server_sort_key(m), m))
  File "/nsys/load_nsys_events.py", line 230, in server_sort_key
    f_num = first_dev_f_num(nvtx_event_map)
  File "/nsys/load_nsys_events.py", line 271, in first_dev_f_num
    assert dev_id is not None
AssertionError

Is there a problem with my execution？

huanggx-sea commented 1 month ago

Seems like the events that the script depends on are missing. This script is only supported when zero-bubble or zero-bubble-v is enabled.

sail-sg / zero-bubble-pipeline-parallelism

[QUESTION] May I ask what tool was used to plot Figure 6 in paper.How can I profile bubble time in pipeline parallelism? #18

(1) Profiling Using nsys

(2) Event Reconstruction from Sqlite

Linking CPU CUDA Calls to GPU kernels

Linking NVTX Events to CPU CUDA Calls

(3) Plotting