Closed zwpaper closed 3 years ago
Need more information about increase share to
and decrease share to
log
@mYmNeo
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 40996, increment: 20498
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 38587, increment: 18089
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 38587, increment: 18089
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 38587, increment: 18089
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 39773, increment: 19275
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 40996, increment: 20498
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 40996, increment: 20498
/tmp/cuda-control/src/hijack_call.c:225 increase share to: 40996, increment: 20498
there is no decrease
logs
there is no
decrease
logs
What's the kind of your application? Training or inference?
Training
It can be seen that the curr core
always go below 0, there is no enough curr core
for rate_limiter
to use.
and because of lack of curr core
, the GPU usage can not go higher
if (top_result.sys_process_num == 1 &&
top_result.user_current < up_limit / 10) {
would always be true, and the curr core
always get an init number, too few to make GPU usage high, then this became a loop
May I ask, why and how we calculate this init value:
And why we calculate this kernel
for rate limit?
The calculation formula is an experience design, different architecture of gpu sometimes may not working well, like your situation. Your situation is that cuda calculation is too fast and update period or init core is too small. So you can decrease period of check or do some modification to init core size.
ok, got you. thanks
I had the same problem, could you tell me how to solve this problem later @zwpaper
manually update the threshold and algorithm, it is not that easy to explain...
manually update the threshold and algorithm, it is not that easy to explain...
need every kind GPU type to modify the threshold and algorithm? Is there a general formula?
@jin-zhengnan we modified them and it fit all of our needs for all cards
@zwpaper Can you show your's new algorithm for share? Thansks
@jin-zhengnan sorry, i can't, the company owns it
@jin-zhengnan sorry, i can't, the company owns it Hi, could you please provide your ideas? I have made modifications to the "change token" method regarding the adjustment strategy for "g_cur_cuda_cores," as well as reducing the value of "g_total_cuda_cores." However, the results have been mediocre, and the computational power is not effectively constrained.
I created a pod using 33
vcuda-core
withP4
GPU, but the GPU utilization keep stay under3
.after setting the
LOGGER_LEVEL
to6
, I can see this logs:curr core
keeps going from20497
to negative, and utilization stays not larger than 2.after checking the code, I also found P4 has 20 sm and it is calculated:
here are my questions: