Kubuxu commented 8 years ago

Running latest master. LD_PRELOAD to load icd.

 LD_PRELOAD="/usr/lib/libOpenCL.so.1" ./silentarmy --list
Devices on platform "AMD Accelerated Parallel Processing":
  ID 0: Ellesmere
  ID 1: Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz
Devices on platform "NVIDIA CUDA":
  ID 2: GeForce GTX 1070
Devices on platform "AMD Accelerated Parallel Processing":
  ID 3: Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz

Ming with command: LD_PRELOAD="/usr/lib/libOpenCL.so.1" ./silentarmy --use 2 -c stratum+tcp://xxxxxxxxx:3333 -u txxxxxxxxxxxd.nvidia --instances 1 causes whole once core to spin at 100% at the sa-solver.

Kubuxu commented 8 years ago

Here is profiled result: https://gist.github.com/Kubuxu/da34b3d00e3f7f9a4a18b7117631d583

Kubuxu commented 8 years ago

Looks like problem is very slow call to clEnqueueReadBuffer at main.c:785 although I have no idea why.

Kubuxu commented 8 years ago

I have tried waiting on events before reads, in reads, and non blocking reads. I don't know what else could help, so far it looks like bug in nvidia opencl driver for Linux. They might be doing some busy polling.

montvid commented 8 years ago

Did you install cuda 8.0? Might help.

Kubuxu commented 8 years ago

Yes I am using Cuda 8.0.44 from Arch repository.

Kubuxu commented 8 years ago

I tried using clEnqueueMapBuffer in read mode to hopefully mitigate this, unfortunately the core still spins at 100%.

mbevand commented 8 years ago

100% CPU usage on Nvidia is due to busy waiting in their OpenCL implementation. I am going to ship a workaround, based on this solution: https://bitcointalk.org/index.php?topic=181328.0

mbevand commented 8 years ago

For those who really can't wait for this Nvidia CPU usage fix, see these steps to implement the workaround: https://bitcointalk.org/index.php?topic=1666489.msg16819293#msg16819293

Kubuxu commented 8 years ago

I don't know if you've seen my #60 but it works quite well and is a lot less hacky than overwriting arbitrary library function (it is the function that the debugger most commonly breaks at).

Kubuxu commented 8 years ago

but it might be better solution, I don't have really time to evaluate them both.

krnlx commented 8 years ago

include

int inited=0;

void *libc = NULL;

int (libc_clock_gettime)(clockid_t clk_id, struct timespec tp) = NULL;

static void attribute ((constructor)) lib_init(void) { if(inited) return;

libc = dlopen("libc.so.6", RTLD_LAZY); assert(libc);

libc_clock_gettime = dlsym(libc, "clock_gettime"); assert(libc_clock_gettime);

inited++; }

useconds_t sleep_time = 100; //const long INTERVAL_MS = 500 * 100;

//struct timespec sleepValue = {0};

//sleepValue.tv_nsec = INTERVAL_MS; //nanosleep(&sleepValue, NULL);

int clock_gettime(clockid_t clk_id, struct timespec _tp){ lib_init(); //printf("."); usleep(sleep_time); // sleepValue.tv_nsec = INTERVAL_MS; // nanosleep(&sleepValue, NULL); // sched_yield(); int r = (_libc_clock_gettime)(clk_id, tp); return r; }

krnlx commented 8 years ago

gcc -O2 -fPIC -shared -Wl,-soname,libtime.so -o libtime.so libtime.c

krnlx commented 8 years ago

    os.environ["LD_PRELOAD"]="./libtime.so"

in python before launch

mbevand commented 8 years ago

@krnlx: Don't you think that Kubuxu's solution in #60 might be a cleaner/simpler approach? Have you tested yours and his? Any difference in performance?

krnlx commented 8 years ago

testing now. it works. It seems https://github.com/mbevand/silentarmy/pull/60 achieves better performance, need time to test.

krnlx commented 8 years ago

Kubuxu's solution +1-2% performance, but loads cpu more 2954 krnl 20 0 29.249g 108996 90400 S 7.6 2.7 0:45.65 sa-solver
2958 krnl 20 0 29.249g 109644 90116 R 7.6 2.7 0:45.49 sa-solver
2953 krnl 20 0 29.249g 106936 90172 S 7.3 2.7 0:45.15 sa-solver
2955 krnl 20 0 29.249g 108996 90400 S 7.3 2.7 0:45.53 sa-solver
2957 krnl 20 0 29.249g 108808 90216 S 7.3 2.7 0:45.66 sa-solver
2956 krnl 20 0 29.249g 108764 90172 S 7.0 2.7 0:45.66 sa-solver

My solution:

3514 krnl 20 0 29.251g 108152 90180 S 4.0 2.7 0:01.42 sa-solver
3512 krnl 20 0 29.251g 108132 90152 S 3.7 2.7 0:01.39 sa-solver
3513 krnl 20 0 29.251g 108272 90296 R 3.7 2.7 0:01.44 sa-solver
3515 krnl 20 0 29.251g 106384 90448 S 3.7 2.7 0:01.42 sa-solver
3516 krnl 20 0 29.251g 106152 90220 R 3.7 2.7 0:01.42 sa-solver
3517 krnl 20 0 29.251g 108148 90176 S 3.7 2.7 0:01.42 sa-solver

mbevand commented 8 years ago

Ok, so a 2x CPU increase with Kubuxu's solution... But still reasonable at 7.5% per core, per process. What model is your CPU?

montvid commented 8 years ago

On my Intel® Core™ i5-4200U CPU @ 1.60GHz × 4 the cpu load is roughly the same in both solutions. But the videocard is lagging less when I am scrolling the web with Kubuxu solution. Honestly I don't know what command is krnlx using to measure.

Kubuxu commented 8 years ago

I am also getting 7.5% to 8% on the i7-6800k. Check out my suggestion in https://github.com/mbevand/silentarmy/pull/60#issuecomment-260143755.

mbevand commented 8 years ago

@montvid krnlx used top

@Kubuxu Ok. I guess the % of CPU time used is tweakable by adjusting how long we sleep. I like that you measure the average running time. I'll probably merge your fix, unless krnlx has more feedback or ideas.

Kubuxu commented 8 years ago

Yes it can be tweaked but at the cost of possibly reducing the performance.

mbevand commented 8 years ago

This should be fixed by https://github.com/mbevand/silentarmy/commit/a6c3517fc189a934edfa89549664f95b51b965d8

birdie-github commented 7 years ago

@mbevand

I've read that this bug could possibly be worked around by using clWaitForEvents() but I'm not a programmer, so if I'm talking nonsense I promptly apologize.

Kubuxu commented 7 years ago

@birdie-github clWaitForEvents() also busy waits.

birdie-github commented 7 years ago

@Kubuxu

You're right: https://github.com/pandegroup/openmm/issues/1541

It's appaling that NVIDIA does nothing to resolve this bug. It looks like they really care only about CUDA.

montvid commented 7 years ago

Tell me how they care about CUDA? Tromp has a CUDA solver and it has the same problem. Do you know any code that would fix this in CUDA?

mbevand / silentarmy

sa-solver 100% CPU core with nvidia #54

include

include

include

include

include

include