Kernel <calculateScore> was not vectorized

BioJNO commented 7 years ago

Hello,

I'm trying to run pacasus in two situations: on an HPC cluster as an unprivileged user (centos 7, Intel(R) Xeon(R) CPU) and on my own laptop within a VM (Ubuntu 17.04 guest, Intel Core i5). In either case I have Intel processors and no access to GPU so I'm trying to use CPU only.

Using the intel opencl icd in both cases I get numerous errors telling me that CalculateScore and traceback kernels did not vectorize. If I run with the AMD icd I don't get these errors.

With either icd the number of compute units passed to pacasus seem to make no difference at all in speed (typically ~7 sequences per minute). Although if I look at top I can see that multiple processes start but they don't seem to use much resources. I can see that in pyopecl the intel device has a preferred work group size of 128 and the AMD preferred work group size is 1 although I'm not certain what that means, if anything, in practice.

Is this expected behavior using CPU or is there any way I can encourage pacasus to use more of the resources available to it with either driver?

Intel test case command line: python pacasus.py --device_type=CPU --platform_name=Intel --framework=OpenCL --loglevel=DEBUG -L debuglog.txt -1 fastq subset.fastq

debuglog.txt

AMD test case command line: python /bin/Pacasus/pacasus.py --device_type=CPU --platform_name=AMD --framework=OpenCL --loglevel=DEBUG -L debuglogAMD.txt -o amd_out.fasta -1 fastq subset.fastq

debuglogAMD.txt

swarris commented 7 years ago

Hi,

Thank you for the questions. The log files are also very helpful, thanks!

First, the messages about vectorized kernels are not errors, but information from the underlying compiler. It tries to optimize the opencl code and in cases it is not possible to do so, these messages appear. You can safely ignore these messages. Also, some time ago pyopencl became much more verbose, filling up the log files. If you like to see less once you have everything set up, turn up the log level: --loglevel=INFO or even better: --loglevel=ERROR

As for performance: the current version of pacasus processes each sequence separately, which means quite a lot of overhead per sequence. In the near future I'll change this by processing more sequences at once. But this requires changes to the underlying pyPaSWAS code, which is geared towards all-vs-all. I see in the log files your sequences are relatively short. Processing of these short sequences require only a short burst of CPU power, hence you see a low load on your CPU. One way of using more resources is by splitting up the fastq file in several chucks and process these in parallel through the command line. If you have for example 24 cores, you should be able to run 6 pacasus-es in parallel. If CPU load is still low because of the short sequences, you can increase this. Memory usages for long sequences (>25k) can become an issue, restricting the amount of processes you can run in parallel.

Let me know when you run into more issues or if have additional questions.

BioJNO commented 7 years ago

Ah, that makes sense thank you very much!

I'll try chunking my data set.

swarris / Pacasus

Kernel <calculateScore> was not vectorized #4