RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

caldodge commented 11 months ago

This is on a Red Hat 8.6 system, with an Nvidia A30 The Python version is 3.9. The Torch version is 2.0.1 (installed with pip) The ribodetector version is 0.2.7 (installed with pip)

We run the following command on some sample data: ribodetector -d 0 -l 100 -i SRR14098566.fastq.gz -o test.fastq.gz

It fails. Here's the complete program output:

2023-08-16 17:19:43 : INFO Using high MCC model file: /apps/ribodetector/0.2.7/ribodetector/data/ribodetector_600k_variable_len70_101_epoch47.pth 2023-08-16 17:19:44 : INFO Model using cuda for read length 100 loaded 2023-08-16 17:19:45 : INFO Choose batch size: 32768 based on the given GPU RAM size 32GB and max read length 100 2023-08-16 17:20:00 : INFO 5933995 sequences loaded! 2023-08-16 17:20:00 : INFO Writing output non-rRNA sequences into file: test.fastq.gz 0%| | 0/182 [00:05<?, ?it/s] Traceback (most recent call last): File "/apps/ribodetector/0.2.7/bin/ribodetector", line 8, in sys.exit(main()) File "/apps/ribodetector/0.2.7/ribodetector/detect.py", line 726, in main seq_pred.detect() File "/apps/ribodetector/0.2.7/ribodetector/detect.py", line 501, in detect self.run() File "/apps/ribodetector/0.2.7/ribodetector/detect.py", line 260, in run output = self.model( File "/apps/torch/2.0.1/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/apps/ribodetector/0.2.7/ribodetector/model/model.py", line 34, in forward1 last_out = last_items(pack=r_out, unsort=True) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): File "/apps/ribodetector/0.2.7/ribodetector/model/model.py", line 118, in last_items indices = sorted_last_indices(pack=pack) if unsort and pack.unsorted_indices is not None: indices = indices[pack.unsorted_indices]


    return pack.data[indices]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

Does this indicate a problem with the data?  Or a bug in ribodetector? We get the same result if we omit "-d 0" from the command line.

dawnmy commented 11 months ago

This seems to be the same issue with: https://github.com/hzi-bifo/RiboDetector/issues/34.

dawnmy commented 11 months ago

Could you check if there is any CUDA device available with echo $CUDA_VISIBLE_DEVICES, then check which version of CUDA you are using with nvcc --version?

You can also set --chunk_size 256 and -m 8 parameter to avoid out of memory issue. The value can be adjusted according to your memory, GPU memory.

gohweixun commented 10 months ago

Hi @dawnmy, I seem to be encountering the same issue.

I'm running python=3.9, pytorch=2.0.1 installed through conda. When I run python and check CUDA status through torch.cuda, it shows the current device and that it is available.

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/home/user/miniconda3/envs/epicall-nextflow/lib/python3.9/site-packages/ribodetector/model/model.py", line 118, in last_items
    indices = sorted_last_indices(pack=pack)
    if unsort and pack.unsorted_indices is not None:
        indices = indices[pack.unsorted_indices]
                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    return pack.data[indices]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

❯ python
Python 3.9.17 | packaged by conda-forge | (main, Aug 10 2023, 07:02:31)
[GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.current_device()
0
>>>

dawnmy commented 10 months ago

After some Googling, I realized this might be an issue related to the compatibility between the installed Pytorch version and CUDA version. Particularly cuda 11.7 might cause the problem

gohweixun commented 10 months ago

I'm not sure where the incompatibility might lie, but I tried out the previous pytorch installations and was able to get it working by installing the following dependencies prior to installing ribodetector:

mamba install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge

cudatoolkit-dev (which installs nvcc) is not required.

The issue showed up when installing the latest version of Pytorch as per the Pytorch website, both with CUDA 11.7 and CUDA 11.8.

I do not think it is a direct result of incompatibility between the Pytorch version and CUDA version, since the issues that I had showed up when installing via the official Pytorch installation instructions. Perhaps either the newer versions of Pytorch or CUDA are causing these issues.

Edit: On further testing, it looks like it may be an issue caused by the newer versions of Pytorch. I've found that Pytorch 1.12.1 works, while Pytorch 1.13.1 does not. Both are running CUDA 11.6. Here are some findings from my tests:

WORKS:

mamba install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge

DOES NOT WORK:

# Latest Pytorch/CUDA installation as per standard Pytorch installation instructions (Pytorch 2.0.1, CUDA 11.8/11.7)
mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# CUDA 11.6
mamba install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia

# CUDA 11.7
mamba install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

dawnmy commented 10 months ago

@gohweixun Thank you for taking the time to conduct such a thorough test. Based on your observations, it appears there might be an issue with RiboDetector when using a newer version of PyTorch. I'll be looking into this on my end and will subsequently update the installation guide as well as the conda requirements to solve this issue.

Thanks again for bringing this to our attention!

jarffery commented 10 months ago

Hi @dawnmy,

for this issues I added a new line in the model.py to move indices to the same device as pack.data and it might tackle this issue:

@jit.script
def last_items(pack: PackedSequence, unsort: bool) -> Tensor:
    indices = sorted_last_indices(pack=pack)
    if unsort and pack.unsorted_indices is not None:
        # Move indices to the same device as pack.data
        indices = indices.to(pack.data.device)
        indices = indices[pack.unsorted_indices]
    return pack.data[indices]

dawnmy commented 9 months ago

Thank you @jarffery for the suggestion. Has this fix been tested with older (1.6-1.9) and new Pytorch versions. I will test it and update later.

dawnmy commented 8 months ago

The fix needs to be tested

hzi-bifo / RiboDetector

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) #40