Closed JTaozhang closed 10 months ago
Thank you for your interest in DeepH. Currently, support for parallel computing during both training and inference is still under development. We'll take it into account as we improve DeepH.
Dear Prof. Li,
I have tried you softeware on GPU node (contains 8 NVIDIA Graphic Cards) during the training step, I find that I can’t run the software on more than two NVIDIA Graphic Cards. The job is only executed on one Graphic card, and another card was unemployed (Fig.1). I have checked all submitting parameters, which seems ok. May be you can have a glance at it (Fig.2). The second problem is when I use one NVIDIA Graphic Cards, every epoch spends about 200s and the usage of Graphic card also around 40%, which shows slower speed comparing with my another running test (CPU 64 cores). It seems that running DeepH on GPU is not faster than CPUs.
Therefore, Have you try to carry out DeepH on two Graphic cards? Why the running training step I wonder that whether you can show me you submitting parameters, or give some suggestion on my configuration file if I want to run it on more than two Graphic cards.
One more thing, I meet a warning “/public/home/zhangtao/anaconda3/envs/ZT-py39/lib/python3.9/site-packages/deeph/kernel.py:53: UserWarning: Unable to copy scripts warnings.warn("Unable to copy scripts")” when I use GPU to run DeepH. It seems the path can be replaced. I am not sure whether this have any influence on the speed of the training step.
Many thanks to you for your kind guidance.
Best regard, Tao
@.*** Fig.1 the usage of NVIDIA Graphic cards when I specifiy two NVIDIA cards to DeepH. Only the GPU Fan (number 0) works and number 1 GPU fan only take very small part of memory.
@.*** Fig.2 I used slurm job management system. The submitting parameters of training step using 2 NVIDIA cards, are listed here.
@.**@. Fig.3 The epoch time of training step on CPUs and my setting parameters(64 cores).
@. @. Fig.4 The epoch time of training step on GPU (one graphic card)and my setting parameters.
Best regards, Tao
发件人: 李贺 (He @.> 发送时间: 2023年2月26日 22:37 收件人: @.> 抄送: @.>; @.> 主题: Re: [mzjb/DeepH-pack] whether DeepH support parallel computing (for example using mpirun) (Issue #27)
Thank you for your interest in DeepH. Currently, support for parallel computing during both training and inference is still under development. We'll take it into account as we improve DeepH.
― Reply to this email directly, view it on GitHubhttps://github.com/mzjb/DeepH-pack/issues/27#issuecomment-1445377454, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6AD2GJNZEZQUKO6X445CA3WZNTCJANCNFSM6AAAAAAVIO44DM. You are receiving this because you authored the thread.Message ID: @.***>
Thank you for your interest in DeepH. Currently, support for parallel computing during both training and inference is still under development. We'll take it into account as we improve DeepH.
@JTaozhang Hi, I'm a Ph.D. student, not a professor. As I mentioned above, we are currently developing support for parallel training and inference, which is not yet available in the current version of DeepH. As a result, it is not possible to train a DeepH model on two GPUs for now. In my experience, training DeepH models on a RTX 3090 GPU is faster than on a CPU (64 cores).
Unfortunately, I cannot see the Figs.1-4 you attached. The typical training process of DeepH takes a few days. May I know how long it takes for you to train on a single GPU?
Hi Li,
I am a PhD student as well, from SSLAB. The epoch time for CPU (64 Cores) is around 70s(I guess it is second unit). The epoch time for one GPU(NVIDIA Tesla V100) is around 200s and the system distribute 4 CPU cores for this GPU to assist training process.
Best regards, Tao
发件人: 李贺 (He @.> 发送时间: 2023年2月28日 16:17 收件人: @.> 抄送: @.>; @.> 主题: Re: [mzjb/DeepH-pack] whether DeepH support parallel computing (for example using mpirun) (Issue #27)
Thank you for your interest in DeepH. Currently, support for parallel computing during both training and inference is still under development. We'll take it into account as we improve DeepH.
@JTaozhanghttps://github.com/JTaozhang Hi, I'm a Ph.D. student, not a professor. As I mentioned above, we are currently developing support for parallel training and inference, which is not yet available in the current version of DeepH. As a result, it is not possible to train a DeepH model on two GPUs for now. In my experience, training DeepH models on a RTX 3090 GPU is faster than on a CPU (64 cores).
Unfortunately, I cannot see the Figs.1-4 you attached. The typical training process of DeepH takes a few days. May I know how long it takes for you to train on a single GPU?
― Reply to this email directly, view it on GitHubhttps://github.com/mzjb/DeepH-pack/issues/27#issuecomment-1447754077, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6AD2GPAGNM7VLTRGZZC47TWZWYARANCNFSM6AAAAAAVIO44DM. You are receiving this because you were mentioned.Message ID: @.***>
To developer,
I am beginner of DeepH. I am wondering whether DeepH supports paraller computing, for example using mpirun? Because most user use it on multi-cores system, jobs are executed with parallel computing style. When I run DeepH, especially in the train step, I wish to execute the work on multi-cores (CPUs or GPUs). I have allocated the job with multi-cores on "slurm jobs submitting system", but the job seems still work on one CPU or GPU. So, I hope and suggest maybe you can write more details on how to execute the program on multi-cores. This would be more helpful and more faster to people to use it. Many thanks for these helps.
best regards, Tao