Closed ex-rzr closed 6 years ago
Btw, I think that calcCellUpdates.py
calculates CUPS incorrectly (values a way too high): I suppose it counts canisLupusAnkyrinPRED.fasta
(sys.argv[1]
) and canisLupusAnkyrinPRED.fasta_*.fasta
(sys.argv[1] + "_{}.fasta".format(index)
). I.e. called like this
python data/desktop/calcCellUpdates.py data/canisLupusAnkyrinPRED.fasta
But the first file must be canisLupusAnkyrin.fasta
, because it's used in runPerformanceTests.sh
A bit later than promised but here is a pull request from StreamHPC.
Multiple optimizations of GPU implementation, I'll try to describe them shortly:
Main optimizations
Do not use
program.kernel_name
to call kernels. We need multiple kernel launches every iteration. PyOpenCL callsclCreateKernel
, and this operation for some reason is very slow on AMD. So kernels are cached now.Do not recompile the program. I changed types and kernels so they do not depend on lengths of targets and queries and other variables that are changed if not every iteration but very often. Of course PyOpenCL caches programs but this doesn't help when a big set of targets and queries is used (with high variety of lengths).
Do not recreate buffers every iteration. Buffer allocation is quite expensive. Now buffers are recreated only when they can't store new data.
Do not run
traceback
if there are no potential starting points (i.e. scores of all alignments are lower than desired).calculateScore
returns a special flag to signal iftraceback
is required. In most cases just a few percent of all iterations run traceback.Do not load directions matrix to host if there are not starting points found by
traceback
.Optimize kernels. Multiple optimizations: better calculation of max score (reduction), less operations inside the main loop, better local memory usage, etc.
Miscellaneous changes
Improved logging of current process. As you can see it now contains information of current and average performance, total time and projected "time of accomplishment".
Sorting of alignments in the result file. Especially useful during development to check that results are not changed by the last code change (using
diff
for example).Could you check how it works on your data (correctness and performance), especially large?
Most of these optimizations can be applied to OpenCL CPU and CUDA implementations. The current OpenCL GPU implementation works a few times faster than before. On our GPUs it also works a few times faster than CUDA implementation.
For example, alignment of 340 vs 4284 canisLupusFamiliaris.faa_4284.zip sequences (468 hits) takes
without gap extension:
with gap extension (
-g -1 --filter_factor=0.3
):Does it show the same improvement for you?
I use
--filter_factor=0.3
here because it looks like the default value (0.2) is too low as it generates too many "false positive" alignments (i.e. they are returned bytraceback
but very rarely meet all requirements). In this case optimizations 4 and 5 don't improve performance. Perhaps, this parameter should have a different value when gap extension is enabled?Feel free to ask if something in the description is not clear.