marcodelapierre / md-dockerfiles

My Dockerfiles
GNU General Public License v3.0
1 stars 1 forks source link

dp_train with gpu #1

Open fkxie opened 5 years ago

fkxie commented 5 years ago

Hi, I want to use dp_train with gpu acceleration. My running command is:

docker run --runtime=nvidia -v /home/software/deepmd-kit:/home/deepmd --device /dev/nvidia1:/dev/nvidia1 \
deepmd/deepmd-kit_gpu  \
dp_train /home/deepmd/examples/train/water.json 

But the output line

# DEEPMD: gpu per node: None

SO, maybe I don't use gpu accelerate correctly. Is there something wrong ? Please correct me.

Thanks.

F.K.xie

marcodelapierre commented 5 years ago

Hi, have you got NVIDIA drivers installed in your host machine, with a version high enough to be compatible with CUDA 9.0? Do you see the GPU if you run this test command? docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

fkxie commented 5 years ago

Hi, Thanks for your reply. I have Installed nvidia and cuda10.0, ‘docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi’ not work,

but I try ‘docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi’, It shows: NVIDIA-SMI 410.48 Driver Version: 410.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 00000000:1B:00.0 Off | 0 | | N/A 48C P0 44W / 250W | 11174MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... Off | 00000000:1C:00.0 Off | 0 | | N/A 43C P0 78W / 250W | 11650MiB / 32480MiB | 42% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-PCIE... Off | 00000000:1D:00.0 Off | 0 | | N/A 45C P0 42W / 250W | 10802MiB / 32480MiB | 0% Default |

And I turn on the ‘with_distibut’ as true, but it made no differences.

BTW,it cost about 2.5 seconds per 100 steps testing for the example.

So, my clear problem is how could I know whether the ‘dp_train’ is accelerated by gpu, there’s no information about gpu dumped when training data.But when running lammps in this docker, I could see information about gpu accelerating dumped.

F.K.xie

发信人:Marco De La Pierrenotifications@github.com

收信人:marcodelapierre/md-dockerfilesmd-dockerfiles@noreply.github.com

抄 送:1385750186713857501867@sina.cnAuthorauthor@noreply.github.com

时间:19年05月28日 21:57:22

Hi, have you got NVIDIA drivers installed in your host machine, with a version high enough to be compatible with CUDA 9.0? Do you see the GPU if you run this test command? docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

— You are receiving this because you authored the thread. Reply to this email directly,view it on GitHub(https://github.com/marcodelapierre/md-dockerfiles/issues/1?email_source=notifications&email_token=AKK2O2ZEXIIOLUBZRBENWTTPXU24FA5CNFSM4HPP7V32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWMGT6Q#issuecomment-496527866), ormute the thread(https://github.com/notifications/unsubscribe-auth/AKK2O2YD4CXR6MCGWDP2US3PXU24FANCNFSM4HPP7V3Q).

marcodelapierre commented 5 years ago

When I run the dp_train with the deepmd GPU container on a Pascal server, I see either # DEEPMD: gpu per node: [0] or # DEEPMD: gpu per node: [0, 1, 2, 3]

depending on whether I am using 1 or 4 GPUs of the server, so your dp_train output seems to suggest you're not seeing the GPUs.

Can you try and use this container that I built out of the Dockerfile in this repo: marcodelapierre/deepmd-gpu:0.12.4_tf1.8_lmp_yz and let me know how it goes?

fkxie commented 5 years ago

OK,I'll try it

fkxie commented 5 years ago

Hi, marcodelapierre/deepmd-gpu:0.12.4_tf1.8_lmp_yz is exactly what I use now.

The problem I mentioned above: gpu per node: none.

But when I try dp_train, dp_frz . I can see some information about gpu dumped.