Open starstream opened 2 months ago
We need to write some scripts for that manually. I will put our plotting scripts to this repo later. If you are trying to draw this figure for your project, you need to write your own as it relies on the NVTX events in the codes. Here are all the details below.
So what we want here is to find out all the CUDA kernels for each F/B/W/Optimizer so that we can get the GPU start/end time for it.
The main workflow is:
nsys profile
to do profiling and get the nsys-rep
files. The CUDA and NVTX events will be in these files. Export nsys-rep
files to sqlite
files.The 2nd is the most tricky part. I will break them down one by one.
Technically, there are 3 kinds of events involved:
In our code, when profiling of Megatron is turned on, torch.cuda.nvtx.range_push/range_pop
will be used to mark the start/end time of F,B,W,Optimizer on CPU. Through this NVTX event duration, we can find out all the related CUDA calls on CPU, then the CUDA kernels on the GPU. With this NVTX → kernels mapping, we can know the start/end time of each F,BW,Optimizer function on GPU.
So first thing first, launch Megatron to do training and get the profiling data using nsys
command.
nsys profile -s none -t nvtx,cuda <Command to run megatron like "torch run">
You may want to add more options to nsys like the following depending on your usage:
--capture-range=cudaProfilerApi --capture-range-end=stop
Now you get the nsys-rep
files. Export them to sqlite files using nsys
or Nsight.
nsys export --type sqlite --output <sqlite file> <nsys-rep file>
Now the heavy lifting part.
There are 3 tables corresponding to the 3 kinds of events:
You can use .schema
command in sqlite3 to check these table out.
What we are basically going to do is to use SQL and script codes to reconstruct the mapping between:
NVTX events → CPU CUDA calls → GPU kernels
This is pretty straightforward, there’s a correlationId
in both CUPTI_ACTIVITY_KIND_KERNEL and CUPTI_ACTIVITY_KIND_RUNTIME. We can just join these 2 tables.
But there are still some details to take note.
SELECT runtime.correlationId, kernel.start, kernel.end, runtime.start, runtime.end,
kernel.deviceId, kernel.shortName, runtime.nameId, runtime.globalTid
FROM CUPTI_ACTIVITY_KIND_KERNEL as kernel, CUPTI_ACTIVITY_KIND_RUNTIME as runtime
WHERE runtime.correlationId = kernel.correlationId
and kernel.globalPid / 0x1000000 % 0x1000000 = runtime.globalTid / 0x1000000 % 0x1000000
and runtime.globalTid in (
SELECT distinct globalTid FROM NVTX_EVENTS
WHERE text like "F%" or text like "B%" or text like "W%" or text = "Optimizer"
);
Firstly, different processes may generate the same correlationId
value so also need to take PID into account when joining the records. PID is globalTid / 0x1000000 % 0x1000000
and globalPid / 0x1000000 % 0x1000000
according to the NVIDIA doc.
Secondly, we are just interested in the threads where our NVTX events happen. So need to filter them through globalTid.
Thirdly, the kernel.shortName and runtime.nameId are just ids, not strings. We need to get back the strings in table StringIds by these ids. I prefer to do this in the script, not SQL.
Now comes the hard part. There’s no direct connection between NVTX events and CUDA calls.
The only thing we can rely on is the start/end time of NVTX events and CUDA calls. A CUDA call belongs to a NVTX event only if the call happens inside the event in the same thread. More accurately, since NVTX events can be stacked, we normally only care about the stack-top NVTX event.
For example, if we have 3 ranges A, B, and C in the same thread:
AAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBB CCC
CUDA call
Time --->
Range B is the likely the F,B,W NVTX event we’re looking for.
The SQL to pull out all the related NVTX events would be:
SELECT start, end, globalTid, text from NVTX_EVENTS
WHERE eventType = 35 or eventType = 36 and
(text like "F%" or text like "B%" or text like "W%" or text = "Optimizer")
The event type 35,36 here are NvtxPushPopRange/NvtxPushPopRange
, which can be found in table ENUM_NSYS_EVENT_TYPE. I’m not sure whether these values will always be constant or could be changed.
Then we need to write a script to match the NVTX events and CUDA API calls by checking the duration range and thread id. Since the data are large in many cases, you may need to optimize the matching algorithm a bit using binary search or other methods.
Now you know for each NVTX event, F/B/W/Optimizer in our project, which CUDA kernels belong to it. We can easily get the start/end time of this event by the start time of the first kernel and the end time of the last kernel. We can also get the deviceId from table CUPTI_ACTIVITY_KIND_KERNEL.
I would like to thank Jay and Holly from Nvidia Support for helping us to figure all this all.
We use Python and lib drawsvg
for plotting. You can refer to the code of our playground.
If you are running multiple servers, you will need to combine the sqlites on each server together.
I run it with the following command
nsys profile -s none -t nvtx,cuda -o pp_profing.nsys-rep -f true --capture-range=cudaProfilerApi --capture-range-end=stop \
torchrun $DISTRIBUTED_ARGS ../pretrain_gpt.py \
...
--fp16 \
--use-flash-attn \
--use-distributed-optimizer \
--profile
nsys export --type sqlite --output pp_profing.sqlite pp_profing_nofa.nsys-rep
python load_nsys_events.py ./
However, the following error occurs
no nvtx event found for kernel DeviceCompactInitKernel
no nvtx event found for kernel DeviceSelectSweepKernel
.....
no nvtx event found for kernel DeviceCompactInitKernel
no nvtx event found for kernel DeviceCompactInitKernel
no nvtx event found for kernel DeviceSelectSweepKernel
Traceback (most recent call last): [338/47259]
File "/nsys/load_nsys_events.py", line 431, in <module>
main(sqlite_dir, sqlite_dir_other, True, True)
File "/nsys/load_nsys_events.py", line 397, in main
nvtx_kernels_map = create_nvtx_kernels_map(sqlite_files, ignore_comm, 0)
File "/nsys/load_nsys_events.py", line 363, in create_nvtx_kernels_map
server_events.append((server_sort_key(m), m))
File "/nsys/load_nsys_events.py", line 230, in server_sort_key
f_num = first_dev_f_num(nvtx_event_map)
File "/nsys/load_nsys_events.py", line 271, in first_dev_f_num
assert dev_id is not None
AssertionError
Is there a problem with my execution?
Seems like the events that the script depends on are missing. This script is only supported when zero-bubble or zero-bubble-v is enabled.
Your question
How can I profile bubble time in pipeline parallelism?