triton-lang / triton

Development repository for the Triton language and compiler
https://triton-lang.org/
MIT License
12.78k stars 1.54k forks source link

The relationship between operator running time and number of runs #3934

Open Ppaddington opened 4 months ago

Ppaddington commented 4 months ago

I used the same data to run the same function five times, and the five running times were: 1282.27764 ms, 0.35153 ms, 0.15597 ms, 0.1487 ms, 0.14346 ms. The difference between 0.14346 ms and 0.35153 ms is too big in terms of percentage!

I was curious why the third, fourth, and fifth running times were shorter than the second? Could anyone help me?

inp = torch.rand([1, 4096], dtype=torch.bfloat16).to(device) # NVIDIA GeForce RTX 4090 fc1 = torch.nn.Linear(4096, 11008, bias=True, dtype=torch.bfloat16).to(device)

1282.27764 ms, 0.35153 ms, 0.15597 ms, 0.1487 ms, 0.14346 ms

for num in range(5): t1_start_time = time.process_time() res1 = my_fuc(inp, fc1.weight, fc1.bias) t1_end_time = time.process_time() t1_duration = t1_end_time - t1_start_time print(f"1Evaluation triton costs {num}: {t1_duration*1000: .8f} ms.")

yyinter commented 4 months ago

When you run a function for the first time, the GPU needs to initialize and load the necessary computational resources, which may result in longer execution times. However, subsequent runs of the same function may benefit from cached resources, leading to faster execution times. And you also can record the start and end time points with event.record(), and then calculate the time difference with event.elapsed_time(). This allows for more accurate measurement of GPU runtime

Ppaddington commented 4 months ago

When you run a function for the first time, the GPU needs to initialize and load the necessary computational resources, which may result in longer execution times. However, subsequent runs of the same function may benefit from cached resources, leading to faster execution times. And you also can record the start and end time points with event.record(), and then calculate the time difference with event.elapsed_time(). This allows for more accurate measurement of GPU runtime

Thanks a lot! start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) for num in range(5): start.record() res1 = my_func(inp, fc1.weight, fc1.bias) end.record() torch.cuda.synchronize() elapsed_time = start.elapsed_time(end) print(f'elapsed time {num}: ', elapsed_time) elapsed time 0: 1194.6536865234375 elapsed time 1: 0.1934719979763031 elapsed time 2: 0.1515520066022873 elapsed time 3: 0.1443839967250824 elapsed time 4: 0.14035199582576752 The GPU runtime and wall-clock time conclusions appear to be consistent.

Actually, I focus on wall-clock time because I tend to implement an efficient operator for LLMs inference.

For example, during the decoding stage of Llama-2-7B, there are two MLP feedforward operations (matrix multiplication) in each transformer layer. After the triton compiler (1282.27764 ms) was completed, I got 0.35153 ms wall-clock time during MLP forward operation. However, I wish got 0.14346 ms. If that, I can accelerate the inference end-to-end time!

Could you provide some advice on this?