Closed Kubuxu closed 8 years ago
Here is profiled result: https://gist.github.com/Kubuxu/da34b3d00e3f7f9a4a18b7117631d583
Looks like problem is very slow call to clEnqueueReadBuffer at main.c:785 although I have no idea why.
I have tried waiting on events before reads, in reads, and non blocking reads. I don't know what else could help, so far it looks like bug in nvidia opencl driver for Linux. They might be doing some busy polling.
Did you install cuda 8.0? Might help.
Yes I am using Cuda 8.0.44 from Arch repository.
I tried using clEnqueueMapBuffer in read mode to hopefully mitigate this, unfortunately the core still spins at 100%.
100% CPU usage on Nvidia is due to busy waiting in their OpenCL implementation. I am going to ship a workaround, based on this solution: https://bitcointalk.org/index.php?topic=181328.0
For those who really can't wait for this Nvidia CPU usage fix, see these steps to implement the workaround: https://bitcointalk.org/index.php?topic=1666489.msg16819293#msg16819293
I don't know if you've seen my #60 but it works quite well and is a lot less hacky than overwriting arbitrary library function (it is the function that the debugger most commonly breaks at).
but it might be better solution, I don't have really time to evaluate them both.
/ Temprorary fix for silentarmy - nvidia The MIT License (MIT) Copyright (c) 2016 krnlx, kernelx at me.com /
int inited=0;
void *libc = NULL;
int (libc_clock_gettime)(clockid_t clk_id, struct timespec tp) = NULL;
static void attribute ((constructor)) lib_init(void) { if(inited) return;
libc = dlopen("libc.so.6", RTLD_LAZY); assert(libc);
libc_clock_gettime = dlsym(libc, "clock_gettime"); assert(libc_clock_gettime);
inited++; }
useconds_t sleep_time = 100; //const long INTERVAL_MS = 500 * 100;
//struct timespec sleepValue = {0};
//sleepValue.tv_nsec = INTERVAL_MS; //nanosleep(&sleepValue, NULL);
int clock_gettime(clockid_t clk_id, struct timespec _tp){ lib_init(); //printf("."); usleep(sleep_time); // sleepValue.tv_nsec = INTERVAL_MS; // nanosleep(&sleepValue, NULL); // sched_yield(); int r = (_libc_clock_gettime)(clk_id, tp); return r; }
gcc -O2 -fPIC -shared -Wl,-soname,libtime.so -o libtime.so libtime.c
os.environ["LD_PRELOAD"]="./libtime.so"
in python before launch
@krnlx: Don't you think that Kubuxu's solution in #60 might be a cleaner/simpler approach? Have you tested yours and his? Any difference in performance?
testing now. it works. It seems https://github.com/mbevand/silentarmy/pull/60 achieves better performance, need time to test.
Kubuxu's solution +1-2% performance, but loads cpu more
2954 krnl 20 0 29.249g 108996 90400 S 7.6 2.7 0:45.65 sa-solver
2958 krnl 20 0 29.249g 109644 90116 R 7.6 2.7 0:45.49 sa-solver
2953 krnl 20 0 29.249g 106936 90172 S 7.3 2.7 0:45.15 sa-solver
2955 krnl 20 0 29.249g 108996 90400 S 7.3 2.7 0:45.53 sa-solver
2957 krnl 20 0 29.249g 108808 90216 S 7.3 2.7 0:45.66 sa-solver
2956 krnl 20 0 29.249g 108764 90172 S 7.0 2.7 0:45.66 sa-solver
My solution:
3514 krnl 20 0 29.251g 108152 90180 S 4.0 2.7 0:01.42 sa-solver
3512 krnl 20 0 29.251g 108132 90152 S 3.7 2.7 0:01.39 sa-solver
3513 krnl 20 0 29.251g 108272 90296 R 3.7 2.7 0:01.44 sa-solver
3515 krnl 20 0 29.251g 106384 90448 S 3.7 2.7 0:01.42 sa-solver
3516 krnl 20 0 29.251g 106152 90220 R 3.7 2.7 0:01.42 sa-solver
3517 krnl 20 0 29.251g 108148 90176 S 3.7 2.7 0:01.42 sa-solver
Ok, so a 2x CPU increase with Kubuxu's solution... But still reasonable at 7.5% per core, per process. What model is your CPU?
On my Intel® Core™ i5-4200U CPU @ 1.60GHz × 4 the cpu load is roughly the same in both solutions. But the videocard is lagging less when I am scrolling the web with Kubuxu solution. Honestly I don't know what command is krnlx using to measure.
I am also getting 7.5% to 8% on the i7-6800k. Check out my suggestion in https://github.com/mbevand/silentarmy/pull/60#issuecomment-260143755.
@montvid krnlx used top
@Kubuxu Ok. I guess the % of CPU time used is tweakable by adjusting how long we sleep. I like that you measure the average running time. I'll probably merge your fix, unless krnlx has more feedback or ideas.
Yes it can be tweaked but at the cost of possibly reducing the performance.
This should be fixed by https://github.com/mbevand/silentarmy/commit/a6c3517fc189a934edfa89549664f95b51b965d8
@mbevand
I've read that this bug could possibly be worked around by using clWaitForEvents() but I'm not a programmer, so if I'm talking nonsense I promptly apologize.
@birdie-github clWaitForEvents() also busy waits.
@Kubuxu
You're right: https://github.com/pandegroup/openmm/issues/1541
It's appaling that NVIDIA does nothing to resolve this bug. It looks like they really care only about CUDA.
Tell me how they care about CUDA? Tromp has a CUDA solver and it has the same problem. Do you know any code that would fix this in CUDA?
Running latest master.
LD_PRELOAD
to load icd.Ming with command:
LD_PRELOAD="/usr/lib/libOpenCL.so.1" ./silentarmy --use 2 -c stratum+tcp://xxxxxxxxx:3333 -u txxxxxxxxxxxd.nvidia --instances 1
causes whole once core to spin at 100% at the sa-solver.