tkestack / vcuda-controller

Other
488 stars 156 forks source link

P4 GPU can not get enough gpu utilization #12

Closed zwpaper closed 3 years ago

zwpaper commented 3 years ago

I created a pod using 33 vcuda-core with P4 GPU, but the GPU utilization keep stay under 3.

after setting the LOGGER_LEVEL to 6, I can see this logs:

...
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 1, curr core: 17190
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 1, curr core: 17189
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 1, curr core: 17188
/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 1
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 1
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4, curr core: 19275
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4, curr core: 19271
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 2, curr core: 19267
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 1, curr core: 19265
...
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 11371
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 6763
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 2155
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: -2453
/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 1
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 1
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 14667
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 10059
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 5451
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 843
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: -3765
/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 1
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 1
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 14667
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 10059
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 5451
...
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 1, curr core: 677
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 676
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 1, curr core: -3932
/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 0
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 0
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 5, curr core: 20497
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 1, curr core: 20492
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 2, curr core: 20491

curr core keeps going from 20497 to negative, and utilization stays not larger than 2.

after checking the code, I also found P4 has 20 sm and it is calculated:

image

here are my questions:

  1. why use the red square part for the increment calculation?
  2. do you meet this before, because I found the green square you enlarged the increment
  3. how can we fix this? to make p4 can gain enough gpu utilizations
mYmNeo commented 3 years ago

Need more information about increase share to and decrease share to log

zwpaper commented 3 years ago

@mYmNeo

/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 40996, increment: 20498
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 38587, increment: 18089
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 38587, increment: 18089
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 38587, increment: 18089
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 40996, increment: 20498
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 40996, increment: 20498
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 40996, increment: 20498
zwpaper commented 3 years ago

there is no decrease logs

mYmNeo commented 3 years ago

there is no decrease logs

What's the kind of your application? Training or inference?

zwpaper commented 3 years ago

Training

It can be seen that the curr core always go below 0, there is no enough curr core for rate_limiter to use.

and because of lack of curr core, the GPU usage can not go higher

if (top_result.sys_process_num == 1 &&
          top_result.user_current < up_limit / 10) {

would always be true, and the curr core always get an init number, too few to make GPU usage high, then this became a loop

zwpaper commented 3 years ago

May I ask, why and how we calculate this init value:

https://github.com/tkestack/vcuda-controller/blob/9b7b1e675d42b15df8de23e4f8950a1d6af1086a/src/hijack_call.c#L231-L232

https://github.com/tkestack/vcuda-controller/blob/9b7b1e675d42b15df8de23e4f8950a1d6af1086a/src/hijack_call.c#L183-L191

And why we calculate this kernel for rate limit?

https://github.com/tkestack/vcuda-controller/blob/9b7b1e675d42b15df8de23e4f8950a1d6af1086a/src/hijack_call.c#L178

mYmNeo commented 3 years ago

The calculation formula is an experience design, different architecture of gpu sometimes may not working well, like your situation. Your situation is that cuda calculation is too fast and update period or init core is too small. So you can decrease period of check or do some modification to init core size.

zwpaper commented 3 years ago

ok, got you. thanks

WulixuanS commented 2 years ago

I had the same problem, could you tell me how to solve this problem later @zwpaper

zwpaper commented 2 years ago

manually update the threshold and algorithm, it is not that easy to explain...

jin-zhengnan commented 1 year ago

manually update the threshold and algorithm, it is not that easy to explain...

need every kind GPU type to modify the threshold and algorithm? Is there a general formula?

zwpaper commented 1 year ago

@jin-zhengnan we modified them and it fit all of our needs for all cards

jin-zhengnan commented 1 year ago

@zwpaper Can you show your's new algorithm for share? Thansks

zwpaper commented 1 year ago

@jin-zhengnan sorry, i can't, the company owns it

seanchen022 commented 1 year ago

@jin-zhengnan sorry, i can't, the company owns it Hi, could you please provide your ideas? I have made modifications to the "change token" method regarding the adjustment strategy for "g_cur_cuda_cores," as well as reducing the value of "g_total_cuda_cores." However, the results have been mediocre, and the computational power is not effectively constrained.