Benchmarking is currently invalid. Use division of work, not work duplication.

The benchmark is a FIB sequence, duplicating this work over many cores even with 100% efficiency will never yield any speedup.

Work division is needed to see a speedup, I propose a simple, valid benchmark:

Multiplication of many elements of a list, divide the list into chunks and give a chunk to each thread.
Alternatively use a loop to multiply simple numbers many times (>10^9), and divide the loop iterations among threads.

The good news is that if work duplication is currently a similar speed to single threaded code, division of work will already be faster.

larryhastings / gilectomy