Open zhoujingyu13687306871 opened 1 year ago
Which version of dorado are you using? How large is your input?
It's possible dorado is collecting some metadata from the pod5s first and that's taking a while. Is your data on an external disk? Can you try running with a smaller dataset for debugging?
dorado version is 0.3.1 /dev/shm only one pod5 file,the file equal 1.2G Is your data on an external disk ? no ,I copy pod5 file to /dev/shm of localhost
---- Replied Message ---- | From | Joyjit @.> | | Date | 07/07/2023 09:56 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [nanoporetech/dorado] Running a job for a long time without output (Issue #286) |
Which version of dorado are you using? How large is your input?
It's possible dorado is collecting some metadata from the pod5s first and that's taking a while. Is your data on an external disk? Can you try running with a smaller dataset for debugging?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Setup looks good to me.
I did a digging online about the jemalloc: Unsupported page size
issue and there are some reports for incompatibility with aarch64 processors. Not sure if that's the same problem you're seeing yet.
Can you also try to run with -x cpu
? This will force basecalling on CPU (would be very slow) but we can check if it's making any progress. If it doesn't, then at least it's not a CUDA issue.
Setup looks good to me.
I did a digging online about the
jemalloc: Unsupported page size
issue and there are some reports for incompatibility with aarch64 processors. Not sure if that's the same problem you're seeing yet.Can you also try to run with
-x cpu
? This will force basecalling on CPU (would be very slow) but we can check if it's making any progress. If it doesn't, then at least it's not a CUDA issue.
yes, I found jemalloc: Unsupported page size issue online , so I set export MALLOC_CONF=lg_dirty_mult:-1
to my scritps , but It doesn't work
I will try to run with -x cpu
, but the node resource exhausted, so wait a moment please
Dear author I would like to ask you, is it possible to add a version for aarch64 architecture system page size (64K) in a dorado binary and source code distribution? that may completely solve the "jemalloc: Unsupported page size" issue
设置对我来说看起来不错。
我在网上挖掘了这个
jemalloc: Unsupported page size
问题,有一些与 aarch64 处理器不兼容的报告。不确定这是否是您遇到的相同问题。你也可以尝试与 一起跑步
-x cpu
吗?这将强制在 CPU 上进行碱基调用(会非常慢),但我们可以检查它是否取得任何进展。如果没有,那么至少这不是 CUDA 问题。
add '-x cpu' to scritps , After the scritps ran 1 hour later , there is still no effective output as follows:
cat slurm-33744.out
cuda-11.7 loaded successful
gcc-11.3.0 loaded successful
<jemalloc>: Unsupported system page size
[2023-07-08 22:09:09.148] [debug] - matching modification model found: dna_r10.4.1_e8.2_400bps_sup@v4.1.0_5mCG_5hmCG@v2
[2023-07-08 22:09:09.149] [info] > Creating basecall pipeline
[2023-07-08 22:09:09.164] [debug] - CPU calling: set batch size to 128, num_runners to 128
and no cpu utilization
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2520386 scx6299 20 0 231104 13824 5376 R 0.7 0.0 0:03.00 top
2490194 scx6299 20 0 215040 4416 3072 S 0.0 0.0 0:00.00 slurm_script
2503034 scx6299 20 0 290240 61952 3648 S 0.0 0.0 0:00.01 sshd
2503035 scx6299 20 0 228416 14272 5696 S 0.0 0.0 0:00.02 bash
Dear author I would like to ask you, is it possible to add a version for aarch64 architecture system page size (64K) in a dorado binary and source code distribution? that may completely solve the "jemalloc: Unsupported page size" issue
Hmm I'm not sure what this would entail tbh. It feels more like something jemalloc would have to support rather than something we can add in dorado
.
The fact that it's not making any progress with CPU either makes me think of I/O issues. Have you tried to run dorado (same binary) in any other environment? I can suggest the following -
/tmp
in your HPC job first and then run dorado on the copied data1、I have no aarch64 local machine 2、I copied data to memory file system: /dev/shm , so I think I/O no problem
---- Replied Message ---- | From | Joyjit @.> | | Date | 07/10/2023 22:51 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [nanoporetech/dorado] Running a job for a long time without output (Issue #286) |
Dear author I would like to ask you, is it possible to add a version for aarch64 architecture system page size (64K) in a dorado binary and source code distribution? that may completely solve the "jemalloc: Unsupported page size" issue
Hmm I'm not sure what this would entail tbh. It feels more like something jemalloc would have to support rather than something we can add in dorado.
The fact that it's not making any progress with CPU either makes me think of I/O issues. Have you tried to run dorado (same binary) in any other environment? I can suggest the following -
Run dorado on a local machine instead of the cluster with the data local as well Copy the data to /tmp in your HPC job first and then run dorado on the copied data
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Hi @zhoujingyu13687306871 - are you able to compile Dorado yourself on the kunpeng920 machine by any chance? This is not a problem we've encountered before, I suspect that during compilation the page size of your host would be detected and Dorado will be compiled to work with the appropriate (64KB?) page size (Side note is that this may have performance implications, though I think it will be fine)
你好@zhoujingyu13687306871 - 你能在kunpeng920机器上自己编译Dorado吗?这不是我们以前遇到过的问题,我怀疑在编译过程中会检测到主机的页面大小,并且 Dorado 将被编译为使用适当的(64KB?)页面大小(旁注是,这可能会影响性能影响,虽然我认为这会很好)
yes, I compiled dorado on kunpeng920 machine, which system page size is 64K
OK - this is probably because the POD5 dependency is not compiled to use 64KB page size. We are investigating a solution
thank you very much
---- Replied Message ---- | From | Mike @.> | | Date | 07/12/2023 18:23 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [nanoporetech/dorado] Running a job for a long time without output (Issue #286) |
OK - this is probably because the POD5 dependency is not compiled to use 64KB page size. We are investigating a solution
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
好的 - 这可能是因为 POD5 依赖项未编译为使用 64KB 页面大小。我们正在研究解决方案
@vellamike Hi, I would like to ask, in the past half year, which version of the newly released Dorado version has fixed this bug?
Hi @zhoujingyu13687306871 - we haven't looked at fixing this yet
dear author: I submit the job to run on a single node of the cluster, but after a long time, there is no output. The single-node CPU is aarch64 architecture, the cpu model is kunpeng920, the GPU is A100-40 pcie, I would show you cpu information and the script content is as follows:
After running for 1 hour, there is only debug content, and no real results are output, as shown in the figure below: the output debug content ion the left, and the GPU utilization information on the right,and the fig below is the CPU utilization, which present S state for a long time. I don't know whether it is caused by the CPU instruction set or the system page size (: Unsupported system page size), I hope to get your reply, thank you!