swarris / Pacasus

Correction of palindromes in long reads from PacBio and Nanopore
MIT License
14 stars 3 forks source link

Not running with reads > 25kb? #6

Closed MargauxAlison closed 5 years ago

MargauxAlison commented 5 years ago

Hello,

First thank you for developping Pacasus, I was very happy to discover such a tool. However I encounter some problems when running pacasus on my data set. It failed after few second, and at the end of the output I receive :

`DEBUG - pyopencl-invoker-cache-v1: in mem cache hit [key=d33c929dce56ea9970bc4c7caa8922ef69ff38ce589bb361354c4e44d4d5a39f]
INFO - build program: kernel 'calculateScore' was part of a lengthy source build resulting from a binary cache miss (0.26 s)
DEBUG - pyopencl-invoker-cache-v1: in mem cache hit [key=d33c929dce56ea9970bc4c7caa8922ef69ff38ce589bb361354c4e44d4d5a39f]
INFO - build program: kernel 'calculateScore' was part of a lengthy source build resulting from a binary cache miss (0.26 s)
ERROR - clEnqueueNDRangeKernel failed: <unknown error -9999>
Traceback (most recent call last):
  File "/usr/local/bioinfo/src/Pacasus/Pacasus-v1.1/pacasus.py", line 13, in <module>
    ppw.run()
  File "/usr/local/bioinfo/src/Pacasus/Pacasus-v1.1/pacasus/pacasusall.py", line 155, in run
    results.extend(self.program.process(query_sequences, target_sequences, self))
  File "/usr/local/bioinfo/src/Pacasus/Pacasus-v1.1/pypaswas/pyPaSWAS/Core/Programs.py", line 372, in process
    results = self.smith_waterman.align_sequences(cur_records_seq[:1], cur_targets, target_index)
  File "/usr/local/bioinfo/src/Pacasus/Pacasus-v1.1/pypaswas/pyPaSWAS/Core/SmithWaterman.py", line 507, in align_sequences
    self._calculate_score()
  File "/usr/local/bioinfo/src/Pacasus/Pacasus-v1.1/pypaswas/pyPaSWAS/Core/SmithWaterman.py", line 567, in _calculate_score
    self._execute_calculate_score_kernel(number_of_blocks, idx, idy)
  File "/usr/local/bioinfo/src/Pacasus/Pacasus-v1.1/pypaswas/pyPaSWAS/Core/SmithWatermanOcl.py", line 490, in _execute_calculate_score_kernel
    self.d_global_direction_zero_copy)
  File "/usr/local/bioinfo/src/Pacasus/Pacasus-v1.1_venv/lib/python2.7/site-packages/pyopencl-2017.2.2-py2.7-linux-x86_64.egg/pyopencl/cffi_cl.py", line 1765, in __call__
    return self._enqueue(self, queue, global_size, local_size, *args, **kwargs)
  File "<generated code>", line 175, in enqueue_knl_calculateScore
  File "/usr/local/bioinfo/src/Pacasus/Pacasus-v1.1_venv/lib/python2.7/site-packages/pyopencl-2017.2.2-py2.7-linux-x86_64.egg/pyopencl/cffi_cl.py", line 1951, in enqueue_nd_range_kernel
    global_work_size, local_work_size, c_wait_for, num_wait_for))
  File "/usr/local/bioinfo/src/Pacasus/Pacasus-v1.1_venv/lib/python2.7/site-packages/pyopencl-2017.2.2-py2.7-linux-x86_64.egg/pyopencl/cffi_cl.py", line 663, in _handle_error
    raise e
LogicError: clEnqueueNDRangeKernel failed: <unknown error -9999>

When I run it again after removing reads with length superior at 23kb it seems to work (It is running since 17h so I assume it is functionning). Is there a limitation in read length in Pacasus? If not do you have an idea of what the problem coule be?

Thanks,

Best,

Margaux-Alison

swarris commented 5 years ago

Dear Margaux-Alison,

Thank you for your interest in Pacasus!

The maximum read length is limited by the amount memory available on the computer (or GPU). Up to 20kb usually fits in 4GB RAM. Lengths above that you need more memory. Currently, OpenCL on a CPU allows up 64GB to be allocated, which should be enough for most PacBio datasets. ONT sets will contain reads that will not fit in memory.

This is all due to the quadratic size of the scoring matrix in the Smith-Waterman algorithm. To determine the start of the palindrome, we need to be able to do a full traceback and for this we need the entire matrix in memory.

What I also usually do is split-up the read set in bins of different lengths. Short reads (up to 5kb) can then also be processed in parallel, speeding up the process.

Let me know if I can be of assistance in helping you process your reads. You can also send me an email (its in de pre-print: https://www.biorxiv.org/content/early/2017/08/09/173872). The paper has just be accepted in BMC Genomics :-)

MargauxAlison commented 5 years ago

Dear Swarris,

Congrats for the paper!

And thank you for your answer. I was a bit confused by the error message because nothing was indicating memory problems but now I understand. I will increase the memory and do as you suggested to speed up the analyses.

Thanks again

ericsong commented 4 years ago

Hey @swarris, do you have any tips for setting up OpenCL to run with 64GB of allocable memory?

I installed these drivers https://software.intel.com/en-us/articles/opencl-drivers#cpu-section on an ubuntu VM that should have sufficient memory but clinfo returns

  Global memory size                              63322882048 (58.97GiB)
  Error Correction support                        No
  Max memory allocation                           15830720512 (14.74GiB)

I've been unable to find a way to increase the max memory allocation.

swarris commented 4 years ago

Good question! I've managed to use more than 14GB on a particular system (running centos), but with OpenCL you are very much dependent on OS, driver (version) and device. I'm very surprised that an 'open standard' has so many different ways of working on different devices and with different drivers.... I'm still looking for help to lower the memory requirements for SW so all read lengths can be handled on GPUs or basic desktop PCs. With ONT reads becoming longer and longer this is becoming a real issue. With PacBio HiFi the reads are <25kb so that should work for now for all reads.