gsneha26 / SegAlign

A Scalable GPU-Based Whole Genome Aligner, published in SC20: https://doi.ieeecomputersociety.org/10.1109/SC41405.2020.00043
MIT License
66 stars 16 forks source link

thrust::system::system_error | CUDA free failed: cudaErrorCudartUnloading #59

Open chenhijy opened 1 year ago

chenhijy commented 1 year ago
[2023-09-27T10:02:11-0700] [MainThread] [W] [toil.job] Due to failure we are reducing the remaining try count of job 'LastzRepeatMaskJob' kind-LastzRepeatMaskJob/instance-r63tene1 v11 with ID kind-LastzRepeatMaskJob/instance-r63tene1 to 0
...
Log from job "'LastzRepeatMaskJob' kind-LastzRepeatMaskJob/instance-r63tene1 v12" follows:
=========>
...
      File "/home/cactus/cactus_env/lib/python3.8/site-packages/cactus/preprocessor/lastzRepeatMasking/cactus_lastzRepeatMask.py", line 130, in gpuRepeatMask
        segalign_messages = cactus_call(parameters=cmd, work_dir=self.work_dir, returnStdErr=True, gpus=self.repeatMaskOptions.gpu,
      File "/home/cactus/cactus_env/lib/python3.8/site-packages/cactus/shared/common.py", line 889, in cactus_call
        raise RuntimeError("{}Command {} exited {}: {}".format(sigill_msg, call, process.returncode, out))
    RuntimeError: Command /usr/bin/time -f "CACTUS-LOGGED-MEMORY-IN-KB: %M" segalign_repeat_masker /tmp/58f5d3ffa02e55c3b06625f0f8626408/0d5a/937a/tmpfg2qo5qy/gSojMU042_0_0.tgt --lastz_interval=10000000 --markend --neighbor_proportion 0.2 --M 10 --step=3 --ambiguous=iupac,100,100 --num_gpu 1 exited 134: stderr=Using 64 threads
...
    Error: cudaMemcpy of 4 bytes for num_anchors failed with error " invalid argument " 
    terminate called after throwing an instance of 'thrust::system::system_error'
      what():  CUDA free failed: cudaErrorCudartUnloading: driver shutting down
    Command terminated by signal 6
    CACTUS-LOGGED-MEMORY-IN-KB: 69902308

My OS is Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-153-generic x86_64).

Some specs for the GPU I'm using:

 +-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:CA:00.0 Off |                    0 |
| N/A   36C    P0    48W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+