gpu utilization rate 0 or near 0

yutao-li commented 5 years ago

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 16.04.6 LTS (GNU/Linux 4.4.0-143-generic x86_64)
Ray installed from (source or binary): pip install ray[rllib]==0.6.4
Ray version:0.6.4
Python version:3.7.3
Exact command to reproduce: curl -O https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh sh Anaconda3-2019.03-Linux-x86_64.sh -y source ~/.bashrc conda install -y keras tensorflow-gpu pandas numpy matplotlib psutil pip install gym ray[rllib]==0.6.4 ray[debug]==0.6.4

Describe the problem

I use ddpg algorithm in rllib to train in the pendulum env, Screenshot from 2019-04-18 12-49-21

and I find that the gpu utilization rate is near 0, while cpu usage is 100%, why is that? Actually, I am just testing the gpu usage with this toy example, and I am currently attempting to train ddpg with offline dataset, in that case, the gpu utilization is 0, although the process is allocated.

Source code / logs

the code is as follows,

import os
import ray.rllib.agents.ddpg as ddpg
import ray
from ray.tune.logger import pretty_print
from ray.tune import run_experiments
ray.init(num_gpus=1, temp_dir='/tmp/yutao')
config = ddpg.DEFAULT_CONFIG.copy()
config.update({
    'num_workers': 1,
    "input_evaluation": [],
    'num_cpus_per_worker': 6,
    'num_gpus_per_worker': 1,
    'num_gpus': 1,
    'exploration_final_eps': 0,
    'exploration_fraction': 0
})
agent = ddpg.DDPGAgent(config=config, env="Pendulum-v0")
for i in range(10000):
    result = agent.train()
    print(pretty_print(result))

    if i % 200 == 0:
        checkpoint = agent.save(os.getcwd() + '/checkpoint/')
        print("checkpoint saved at", checkpoint)

below is the output during training, I interrupt it in advance in light of the training time. log.txt

please remind me if other information is needed, since I haven't posted an issue before, I don't know what is needed, thanks!

ericl commented 5 years ago

The default batch size and model are pretty small, I wouldn't be surprised if this is the case. You can adjust this as 'train_batch_size'.

yutao-li commented 5 years ago

Hi ericl, thank you for your reply, I have set "train_batch_size": 100000, but the utilization is still the same Screenshot from 2019-04-18 17-52-29

It seems to have no change.

yutao-li commented 5 years ago

btw, does it make a differene when I train only with offline dataset, at least they are both not utilizing gpu in my case

ericl commented 5 years ago

I tried this out myself, and it looks like it is because most CPU time is spent in the replay buffer. This is expected when you have a small network for the policy so the GPU isn't very loaded.

To get higher throughput, you can also try using APEX_DDPG which moves the replay buffer off the critical path of the learner.

yutao-li commented 5 years ago

thanks, meanwhile, will apex ddpg accelerate training when I only use offline dataset, theoretically? I am not familiar with apex ddpg and am not sure if it only accelerates the sampling stage.

ericl commented 5 years ago

Hm, it depends on how much data you have. If there is a lot, accelerating sampling will also increase GPU utilization.

On Thu, Apr 18, 2019, 10:21 PM master_lee notifications@github.com wrote:

thanks, meanwhile, will apex ddpg accelerate training when I only use offline dataset, theorectically? I am not familiar with apex ddpg and am not sure if it only accelerates the sampling stage.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/4658#issuecomment-484770894, or mute the thread https://github.com/notifications/unsubscribe-auth/AAADUSS4E2FKNMGRKIW6ZLTPRFJGRANCNFSM4HHADO3Q .

yutao-li commented 5 years ago

thank you, I tried this code(in another kubernetes container, which has more memory,), but the gpu utilization is still near 0, btw, I update the version to 0.6.6 Screenshot from 2019-04-19 23-47-52

import ray
import ray.rllib.agents.ddpg.apex as ddpg
from ray.tune.logger import pretty_print

ray.init(num_gpus=2)

config = ddpg.APEX_DDPG_DEFAULT_CONFIG.copy()
config.update({
    'num_workers': 48,
    "input_evaluation": [],
    'num_cpus_per_worker': 1,
    # 'num_gpus_per_worker': 1,
    'num_gpus': 2,
    'exploration_final_eps': 0,
    'exploration_fraction': 0,
    "train_batch_size": 10000
})
agent = ddpg.ApexDDPGTrainer(config=config, env="Pendulum-v0")
for i in range(10000):
    result = agent.train()
    print(pretty_print(result))

my cpu info is

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
Stepping:              4
CPU MHz:               3599.998
CPU max MHz:           3700.0000
CPU min MHz:           1200.0000
BogoMIPS:              6006.95
Virtualization:        VT-x
Hypervisor vendor:     vertical
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0-11,24-35
NUMA node1 CPU(s):     12-23,36-47
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke

is it normal to have such a low utilization? I see that the rate mostly remains 0%, increases up to 3% sometimes. but if I run rllib train --run DQN --env CartPole-v0 --resources-per-trial '{"cpu": 48, "gpu": 2}' --config '{"num_gpus": 2}' in command line, it has higher utilization as follows Screenshot from 2019-04-19 23-58-26

I don't know why apex ddpg seems to utilize gpu poorly, but if it is normal, that's fine.

ericl commented 5 years ago

I tried DQN vs DDPG out, and get similar utilization when I set the batching hyperparams for DDPG to be the same for DQN:

rllib train --config='{"learning_starts": 0, "num_gpus": 1, "sample_batch_size": 4, "train_batch_size": 32}' --run=DDPG --env=Pendulum-v0

Gives me 15% GPU utilization, same as DQN. So I think this is just a matter of hyperparameter settings.

ray-project / ray