Running a job for a long time without output (kunpeng920 CPU)

zhoujingyu13687306871 commented 1 year ago

dear author: I submit the job to run on a single node of the cluster, but after a long time, there is no output. The single-node CPU is aarch64 architecture, the cpu model is kunpeng920, the GPU is A100-40 pcie, I would show you cpu information and the script content is as follows:

[scx6299@paraai-n32-h-01-agent-1 dorado-test]$ lscpu
Architecture:                    aarch64
CPU op-mode(s):                  64-bit
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              1
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    4
Vendor ID:                       HiSilicon
Model:                           0
Model name:                      Kunpeng-920
Stepping:                        0x1
BogoMIPS:                        200.00
L1d cache:                       8 MiB
L1i cache:                       8 MiB
L2 cache:                        64 MiB
L3 cache:                        128 MiB
NUMA node0 CPU(s):               0-31
NUMA node1 CPU(s):               32-63
NUMA node2 CPU(s):               64-95
NUMA node3 CPU(s):               96-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v
                                 ia prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atom
                                 ics fphp asimdhp cpuid asimdrdm jscvt fcma dcpo
                                 p asimddp asimdfhm ssbs

#!/bin/bash
#SBATCH -J dorado-test
#SBATCH -N 1
#SBATCH --gpus=2
#SBATCH -n 64
module purge
module load compilers/cuda/11.7 compilers/gcc/11.3.0 anaconda/2021.11 cudnn/8.4.0.27_cuda11.x
source activate pytorch-2.0
export LD_LIBRARY_PATH=/home/bingxing2/home/scx6299/software/nccl-2.17.1-1/build/lib:$LD_LIBRARY_PATH
export CPATH=/home/bingxing2/home/scx6299/software/nccl-2.17.1-1/build/include:$CPATH
export LD_LIBRARY_PATH=/home/bingxing2/home/scx6299/software/hdf5-serial/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=/home/bingxing2/home/scx6299/software/hdf5-serial/lib:$LIBRARY_PATH
export PATH=/home/bingxing2/home/scx6299/software/hdf5-serial/bin:$PATH
export CPATH=/home/bingxing2/home/scx6299/software/hdf5-serial/include:$CPATH
export PATH=/home/bingxing2/home/scx6299/software/dorado-install/bin:$PATH
export LD_LIBRARY_PATH=/home/bingxing2/home/scx6299/software/dorado-install/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=/home/bingxing2/home/scx6299/software/dorado-install/lib:$LIBRARY_PATH

cp -r pod5_pass/PAQ21605_pass__ce971a82_ad9362d2_559.pod5 /dev/shm
ls /dev/shm

dorado basecaller --device cuda:0,1 /home/bingxing2/home/scx6299/dorado-test/model/dna_r10.4.1_e8.2_400bps_sup@v4.1.0 /dev/shm/ --modified-bases 5mCG_5hmCG --verbose > 20230706/pass.bam

After running for 1 hour, there is only debug content, and no real results are output, as shown in the figure below: the output debug content ion the left, and the GPU utilization information on the right，and the fig below is the CPU utilization, which present S state for a long time. I don't know whether it is caused by the CPU instruction set or the system page size (: Unsupported system page size), I hope to get your reply, thank you!

tijyojwad commented 1 year ago

Which version of dorado are you using? How large is your input?

It's possible dorado is collecting some metadata from the pod5s first and that's taking a while. Is your data on an external disk? Can you try running with a smaller dataset for debugging?

zhoujingyu13687306871 commented 1 year ago

dorado version is 0.3.1 /dev/shm only one pod5 file，the file equal 1.2G Is your data on an external disk ? no ，I copy pod5 file to /dev/shm of localhost

---- Replied Message ---- | From | Joyjit @.> | | Date | 07/07/2023 09:56 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [nanoporetech/dorado] Running a job for a long time without output (Issue #286) |

Which version of dorado are you using? How large is your input?

It's possible dorado is collecting some metadata from the pod5s first and that's taking a while. Is your data on an external disk? Can you try running with a smaller dataset for debugging?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

tijyojwad commented 1 year ago

Setup looks good to me.

I did a digging online about the jemalloc: Unsupported page size issue and there are some reports for incompatibility with aarch64 processors. Not sure if that's the same problem you're seeing yet.

Can you also try to run with -x cpu? This will force basecalling on CPU (would be very slow) but we can check if it's making any progress. If it doesn't, then at least it's not a CUDA issue.

zhoujingyu13687306871 commented 1 year ago

Setup looks good to me.

I did a digging online about the jemalloc: Unsupported page size issue and there are some reports for incompatibility with aarch64 processors. Not sure if that's the same problem you're seeing yet.

Can you also try to run with -x cpu? This will force basecalling on CPU (would be very slow) but we can check if it's making any progress. If it doesn't, then at least it's not a CUDA issue.

yes, I found jemalloc: Unsupported page size issue online , so I set export MALLOC_CONF=lg_dirty_mult:-1 to my scritps , but It doesn't work

I will try to run with -x cpu, but the node resource exhausted, so wait a moment please

zhoujingyu13687306871 commented 1 year ago

Dear author I would like to ask you, is it possible to add a version for aarch64 architecture system page size (64K) in a dorado binary and source code distribution? that may completely solve the "jemalloc: Unsupported page size" issue

zhoujingyu13687306871 commented 1 year ago

设置对我来说看起来不错。

我在网上挖掘了这个jemalloc: Unsupported page size问题，有一些与 aarch64 处理器不兼容的报告。不确定这是否是您遇到的相同问题。

你也可以尝试与一起跑步-x cpu吗？这将强制在 CPU 上进行碱基调用（会非常慢），但我们可以检查它是否取得任何进展。如果没有，那么至少这不是 CUDA 问题。

add '-x cpu' to scritps , After the scritps ran 1 hour later , there is still no effective output as follows:

cat slurm-33744.out
cuda-11.7 loaded successful
gcc-11.3.0 loaded successful
<jemalloc>: Unsupported system page size
[2023-07-08 22:09:09.148] [debug] - matching modification model found: dna_r10.4.1_e8.2_400bps_sup@v4.1.0_5mCG_5hmCG@v2
[2023-07-08 22:09:09.149] [info] > Creating basecall pipeline
[2023-07-08 22:09:09.164] [debug] - CPU calling: set batch size to 128, num_runners to 128

and no cpu utilization

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2520386 scx6299   20   0  231104  13824   5376 R   0.7   0.0   0:03.00 top
2490194 scx6299   20   0  215040   4416   3072 S   0.0   0.0   0:00.00 slurm_script
2503034 scx6299   20   0  290240  61952   3648 S   0.0   0.0   0:00.01 sshd
2503035 scx6299   20   0  228416  14272   5696 S   0.0   0.0   0:00.02 bash

tijyojwad commented 1 year ago

Dear author I would like to ask you, is it possible to add a version for aarch64 architecture system page size (64K) in a dorado binary and source code distribution? that may completely solve the "jemalloc: Unsupported page size" issue

Hmm I'm not sure what this would entail tbh. It feels more like something jemalloc would have to support rather than something we can add in dorado.

The fact that it's not making any progress with CPU either makes me think of I/O issues. Have you tried to run dorado (same binary) in any other environment? I can suggest the following -

Run dorado on a local machine instead of the cluster with the data local as well
Copy the data to /tmp in your HPC job first and then run dorado on the copied data

zhoujingyu13687306871 commented 1 year ago

1、I have no aarch64 local machine 2、I copied data to memory file system: /dev/shm , so I think I/O no problem

---- Replied Message ---- | From | Joyjit @.> | | Date | 07/10/2023 22:51 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [nanoporetech/dorado] Running a job for a long time without output (Issue #286) |

Dear author I would like to ask you, is it possible to add a version for aarch64 architecture system page size (64K) in a dorado binary and source code distribution? that may completely solve the "jemalloc: Unsupported page size" issue

Hmm I'm not sure what this would entail tbh. It feels more like something jemalloc would have to support rather than something we can add in dorado.

The fact that it's not making any progress with CPU either makes me think of I/O issues. Have you tried to run dorado (same binary) in any other environment? I can suggest the following -

Run dorado on a local machine instead of the cluster with the data local as well Copy the data to /tmp in your HPC job first and then run dorado on the copied data

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

vellamike commented 1 year ago

Hi @zhoujingyu13687306871 - are you able to compile Dorado yourself on the kunpeng920 machine by any chance? This is not a problem we've encountered before, I suspect that during compilation the page size of your host would be detected and Dorado will be compiled to work with the appropriate (64KB?) page size (Side note is that this may have performance implications, though I think it will be fine)

zhoujingyu13687306871 commented 1 year ago

你好@zhoujingyu13687306871 - 你能在kunpeng920机器上自己编译Dorado吗？这不是我们以前遇到过的问题，我怀疑在编译过程中会检测到主机的页面大小，并且 Dorado 将被编译为使用适当的（64KB？）页面大小（旁注是，这可能会影响性能影响，虽然我认为这会很好）

yes, I compiled dorado on kunpeng920 machine, which system page size is 64K

vellamike commented 1 year ago

OK - this is probably because the POD5 dependency is not compiled to use 64KB page size. We are investigating a solution

zhoujingyu13687306871 commented 1 year ago

thank you very much

---- Replied Message ---- | From | Mike @.> | | Date | 07/12/2023 18:23 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [nanoporetech/dorado] Running a job for a long time without output (Issue #286) |

OK - this is probably because the POD5 dependency is not compiled to use 64KB page size. We are investigating a solution

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

zhoujingyu13687306871 commented 8 months ago

好的 - 这可能是因为 POD5 依赖项未编译为使用 64KB 页面大小。我们正在研究解决方案

@vellamike Hi, I would like to ask, in the past half year, which version of the newly released Dorado version has fixed this bug?

tijyojwad commented 8 months ago

Hi @zhoujingyu13687306871 - we haven't looked at fixing this yet

nanoporetech / dorado

Running a job for a long time without output (kunpeng920 CPU) #286