swarris / Pacasus

Correction of palindromes in long reads from PacBio and Nanopore
MIT License
15 stars 3 forks source link

Is there a small sample dataset? #17

Closed rsharris closed 4 years ago

rsharris commented 4 years ago

I've installed pacasus on a desktop mac, and am trying a small dataset with about 100 reads. It is failing, and it is difficult to troubleshoot since I have no idea what the expected output would be.

Specifically, I am using python 2 and have installed pyOpenCL version 2019.1.2 (via pip). My processor is an intel core i5. I have no GPU that is useful for this program (as best as I can tell). My pacasus command is

    ${pacasus} reads.fa -o reads.cleaned.fa \
      --device_type=CPU --platform_name=Intel --framework=opencl --loglevel=DEBUG

And in early part of the output I see

DEBUG - Found platform <pyopencl.Platform 'Apple' at 0x7fff0000>, however this is not the platform indicated by the user and then later it fails with ERROR - clEnqueueNDRangeKernel failed: INVALID_WORK_GROUP_SIZE The traceback shows an error in /Users/someuser/Library/Python/2.7/lib/python/site-packages/pyopencl/init.py", line 840, in kernel_call and the next thing up the traceback was /Users/someuser/development/pacasus/Pacasus.1.2/pypaswas/pyPaSWAS/Core/SmithWatermanOcl.py", line 487, in _execute_calculate_score_kernel self.d_global_direction_zero_copy)

swarris commented 4 years ago

Pacasus has not been tested on a Mac. And although OpenCL should be platform independent, in practice it is not. Your Mac tells pacasus it is an Apple, but the available options are Intel (as you indicated at the command line) or NVIDIA. I'm not aware of any other Mac users with OpenCL, so debugging and/or testing this is for me not possible. What you could do is change the device type to accelerator i.s.o. CPU. This will use the basic OpenCL implementation which is not optimized for Intel CPUs.

rsharris commented 4 years ago

Thanks. I'm not clear what you mean by "change the device type to accelerator i.s.o. CPU". What does "i.s.o." mean in this context? Does it mean some standard version of OpenCL, for example one from ISO (the standards organization)?

rsharris commented 4 years ago

Ah ... you must mean I should use --device_type=ACCELARATOR. I didn't realize there were choices other than the --device_type=[CPU|GPU] listed in the readme. But now I see in the -h listing there is a third option, --device_type=ACCELARATOR. I will try that.

Sorry I didn't recognize "i.s.o." as an acronym for "instead of". I've never seen that, and my attempts at getting google to explain it were not initially fruitful.

swarris commented 4 years ago

The device type you selected is a CPU (--device_type=CPU). This selects a highly optimized OpenCL code base for Intel CPUs. But it could well be that it uses implementation details which are not available for Intel devices in a Mac. The 'accelerator' option (--device_type=accelerator) uses an OpenCL code base which uses a little as device dependent implementation options as possible. It has been tested on Intel CPU (Linux) and Intel Xeon Phi (Linux main OS) accelerator cards. See the pyPaSWAS paper for more details.

As to your original question, the data for the Pacasus paper are available through the EBI .

swarris commented 4 years ago

Sorry, yes the Pacasus readme does not show this option, but it is available through pyPaSWAS (the alignment module used).

rsharris commented 4 years ago

Thanks,

Just now tried that, and I now get an INVALID_BUFFER_SIZE failure at SmithWatermanOcl.py", line 547.

A couple questions ... (1) Is there any difference between "--device_type=accelerator" and "--device_type=ACCELARATOR" (uppercase and different spelling)? (2) If I use this device_type should I still use "--platform_name=Intel --framework=opencl" Or should I get rid of those?

The output log before the traceback:

INFO - Target sequences OK.
INFO - Processing 1- vs 1-sequences
DEBUG - Fixing palindrome sequences...
DEBUG - Initializing hitlist...
DEBUG - Initializing hitlist OK.
DEBUG - Total memory on Device: 512.0
DEBUG - Initializing normal device memory.
ERROR - create_buffer failed: INVALID_BUFFER_SIZE

What units are "Total memory on Device" shown in? The machine has 16G of physical memory.

rsharris commented 4 years ago

Decided to try this on a different machine, one running linux.

My command:

python some_path_to/pacasus.py \
  reads.fastq --filetype1 fastq \
  -o reads.cleaned.fa \
  --device_type=ACCELARATOR --platform_name=Intel --framework=opencl \
  --loglevel=DEBUG

and pyPaSWAS/Core/SmithWatermanOcl.py throws LogicError with this text: ERROR - clGetPlatformIDs failed: PLATFORM_NOT_FOUND_KHR

The processors on this machine are all Intel Xeons. (as reported by /proc/cpuinfo). So I'm at a loss as to what I should specify for --platform_name if "Intel" is not the right choice.

swarris commented 4 years ago

Just now tried that, and I now get an INVALID_BUFFER_SIZE failure at SmithWatermanOcl.py", line 547.

A couple questions ... (1) Is there any difference between "--device_type=accelerator" and "--device_type=ACCELARATOR" (uppercase and different spelling)? (2) If I use this device_type should I still use "--platform_name=Intel --framework=opencl" Or should I get rid of those?

First, the spelling mistake is indeed a mistake and will not work. For the accelerator type you can leave the other options to default.

The output log before the traceback:

INFO - Target sequences OK.
INFO - Processing 1- vs 1-sequences
DEBUG - Fixing palindrome sequences...
DEBUG - Initializing hitlist...
DEBUG - Initializing hitlist OK.
DEBUG - Total memory on Device: 512.0
DEBUG - Initializing normal device memory.
ERROR - create_buffer failed: INVALID_BUFFER_SIZE

What units are "Total memory on Device" shown in? The machine has 16G of physical memory.

It is in MB. Your device driver does not allow you to allocate more. It depends on the driver (version) and implementation what the memory limitations are.

swarris commented 4 years ago
python some_path_to/pacasus.py \
  reads.fastq --filetype1 fastq \
  -o reads.cleaned.fa \
  --device_type=ACCELARATOR --platform_name=Intel --framework=opencl \
  --loglevel=DEBUG

and pyPaSWAS/Core/SmithWatermanOcl.py throws LogicError with this text: ERROR - clGetPlatformIDs failed: PLATFORM_NOT_FOUND_KHR

The processors on this machine are all Intel Xeons. (as reported by /proc/cpuinfo). So I'm at a loss as to what I should specify for --platform_name if "Intel" is not the right choice.

An Intel Xeon is a CPU and an Intel Xeon Phi is (was) an accelerator card. So on this system your original command line should work:

${pacasus} reads.fa -o reads.cleaned.fa \
      --device_type=CPU --platform_name=Intel --framework=opencl --loglevel=DEBUG

Sorry about all the mess about all the different devices!

rsharris commented 4 years ago

First, the spelling mistake is indeed a mistake and will not work.

To be clear, my experimentation shows that one has to use the misspelt option, i.e. device_type=ACCELARATOR. Other spellings aren't recognized (because the submodule only recognizes the wrong spelling).

For the accelerator type you can leave the other options to default.

I tried this on the mac:

python some_path_to/pacasus.py \
  reads.fastq --filetype1 fastq \
  -o reads.cleaned.fa \
      --device_type=ACCELARATOR \
      --loglevel=DEBUG

It quickly fails since it cannot import pycuda.driver.

I'm giving up on getting this running on the mac. I had originally thought that since my problems will be small that I'd be able to run this on any old machine. (In the pipeline I'm trying to put together, I expect to typically have fewer than 200 reads.) It's clear my assumption was faulty.

rsharris commented 4 years ago

So on this system your original command line should work:

python pacasus reads.fa -o reads.cleaned.fa \
     --device_type=CPU --platform_name=Intel --framework=opencl --loglevel=DEBUG

I tried that now, on the Intel Xeon machine. I still get PLATFORM_NOT_FOUND_KHR.

I think I need to step back and test my install of both opencl and pyPaSWAS on this machine, irrespective of pacasus. Something apparently isn't working in those underpinnings.

Having said that, I'm wondering if it is even worth it for me to try to resolve all this rather than just develop something that does what pacasus does using another aligner (e.g. minimap2 or lastz). It looks like the implementation of pacasus is heavily intertwined with pyPaSWAS and openCL. While that makes sense for large-scale data sets, my data sets will be very small, and the overhead of trying to get these packages to work together looks like it would be an ongoing headache if I want to be able to deliver a pipeline that people can run on diverse architectures.

swarris commented 4 years ago

Sorry to hear you run into such much trouble, From my own experience, if an opencl/cuda installation works immediately, all is fine. But when sometime goes wrong it can be very difficult to fix the problems.

A different mapper should also work, but please be aware that most mappers cannot handle a 20-40% error rate with ONT/PacBio reads. It is double the amount, because the read is aligned to itself. Let me know if you need any further assistance. I can also do the analyses if needed, no problem (s.warris@gmail.com).

rsharris commented 4 years ago

(reopened only to make this final reply)

be aware that most mappers cannot handle a 20-40% error rate with ONT/PacBio reads

Good advice, and I'm definitely aware of that. In a previous project. I observed 15% error for PacBio reads from genome-in-a-bottle, read vs reference, which would put the read-to-read error rate at 30%. I'm in the process of estimating this for more recent PacBio sequencing, and my preliminary results for one dataset (at least 1 year old at this point) suggest it was still in the 13-14% range.

Minimap2 has a preset that is allegedly tuned for PacBio read-vs-read. But having used it as part of a miniasm pipeline I'm having doubts about how good it is. Lastz (which I wrote) can definitely deal with these error rates, but would have some shortcomings doing all-vs-all on large datasets.