dp_train with gpu - Githubissues

fkxie commented 5 years ago

Hi, I want to use dp_train with gpu acceleration. My running command is:

docker run --runtime=nvidia -v /home/software/deepmd-kit:/home/deepmd --device /dev/nvidia1:/dev/nvidia1 \
deepmd/deepmd-kit_gpu  \
dp_train /home/deepmd/examples/train/water.json

But the output line

# DEEPMD: gpu per node: None

SO, maybe I don't use gpu accelerate correctly. Is there something wrong ? Please correct me.

Thanks.

F.K.xie

marcodelapierre commented 5 years ago

Hi, have you got NVIDIA drivers installed in your host machine, with a version high enough to be compatible with CUDA 9.0? Do you see the GPU if you run this test command? docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

fkxie commented 5 years ago

Hi, Thanks for your reply. I have Installed nvidia and cuda10.0, ‘docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi’ not work,

And I turn on the ‘with_distibut’ as true, but it made no differences.

BTW，it cost about 2.5 seconds per 100 steps testing for the example.

So, my clear problem is how could I know whether the ‘dp_train’ is accelerated by gpu, there’s no information about gpu dumped when training data.But when running lammps in this docker, I could see information about gpu accelerating dumped.

F.K.xie

发信人：Marco De La Pierrenotifications@github.com

收信人：marcodelapierre/md-dockerfilesmd-dockerfiles@noreply.github.com

抄送：1385750186713857501867@sina.cnAuthorauthor@noreply.github.com

时间：19年05月28日 21:57:22

Hi, have you got NVIDIA drivers installed in your host machine, with a version high enough to be compatible with CUDA 9.0? Do you see the GPU if you run this test command? docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

— You are receiving this because you authored the thread. Reply to this email directly,view it on GitHub(https://github.com/marcodelapierre/md-dockerfiles/issues/1?email_source=notifications&email_token=AKK2O2ZEXIIOLUBZRBENWTTPXU24FA5CNFSM4HPP7V32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWMGT6Q#issuecomment-496527866), ormute the thread(https://github.com/notifications/unsubscribe-auth/AKK2O2YD4CXR6MCGWDP2US3PXU24FANCNFSM4HPP7V3Q).

marcodelapierre commented 5 years ago

When I run the dp_train with the deepmd GPU container on a Pascal server, I see either # DEEPMD: gpu per node: [0] or # DEEPMD: gpu per node: [0, 1, 2, 3]

depending on whether I am using 1 or 4 GPUs of the server, so your dp_train output seems to suggest you're not seeing the GPUs.

Can you try and use this container that I built out of the Dockerfile in this repo: marcodelapierre/deepmd-gpu:0.12.4_tf1.8_lmp_yz and let me know how it goes?

fkxie commented 5 years ago

OK，I'll try it

fkxie commented 5 years ago

Hi, marcodelapierre/deepmd-gpu:0.12.4_tf1.8_lmp_yz is exactly what I use now.

The problem I mentioned above: gpu per node: none.

But when I try dp_train, dp_frz . I can see some information about gpu dumped.

marcodelapierre / md-dockerfiles

dp_train with gpu #1