Optimizations of OpenCL GPU implementation

A bit later than promised but here is a pull request from StreamHPC.

Multiple optimizations of GPU implementation, I'll try to describe them shortly:

Main optimizations

Do not use program.kernel_name to call kernels. We need multiple kernel launches every iteration. PyOpenCL calls clCreateKernel, and this operation for some reason is very slow on AMD. So kernels are cached now.
Do not recompile the program. I changed types and kernels so they do not depend on lengths of targets and queries and other variables that are changed if not every iteration but very often. Of course PyOpenCL caches programs but this doesn't help when a big set of targets and queries is used (with high variety of lengths).
Do not recreate buffers every iteration. Buffer allocation is quite expensive. Now buffers are recreated only when they can't store new data.
Do not run traceback if there are no potential starting points (i.e. scores of all alignments are lower than desired). calculateScore returns a special flag to signal if traceback is required. In most cases just a few percent of all iterations run traceback.
Do not load directions matrix to host if there are not starting points found by traceback.
Optimize kernels. Multiple optimizations: better calculation of max score (reduction), less operations inside the main loop, better local memory usage, etc.

Miscellaneous changes

Improved logging of current process. As you can see it now contains information of current and average performance, total time and projected "time of accomplishment".
```
INFO - Duration: 0.052 | Total: 0:00:37 | Performance:  3.65 GCUPS | Avg:  2.42 GCUPS | Progress: 14.254% | ETA: 0:03:42
```
Sorting of alignments in the result file. Especially useful during development to check that results are not changed by the last code change (using diff for example).

Could you check how it works on your data (correctness and performance), especially large?

Most of these optimizations can be applied to OpenCL CPU and CUDA implementations. The current OpenCL GPU implementation works a few times faster than before. On our GPUs it also works a few times faster than CUDA implementation.

For example, alignment of 340 vs 4284 canisLupusFamiliaris.faa_4284.zip sequences (468 hits) takes

without gap extension:
- 17:39 on CUDA GeForce 980
- 2:46 on OpenCL GeForce 980
- 2:17 on OpenCL FirePro S9300 (1 GPU)
with gap extension (-g -1 --filter_factor=0.3):
- ??? on CUDA GeForce 980
- 4:13 on OpenCL GeForce 980
- 3:38 on OpenCL FirePro S9300 (1 GPU)

Does it show the same improvement for you?

I use --filter_factor=0.3 here because it looks like the default value (0.2) is too low as it generates too many "false positive" alignments (i.e. they are returned by traceback but very rarely meet all requirements). In this case optimizations 4 and 5 don't improve performance. Perhaps, this parameter should have a different value when gap extension is enabled?

Feel free to ask if something in the description is not clear.

swarris / pyPaSWAS

Optimizations of OpenCL GPU implementation #4

Main optimizations

Miscellaneous changes