hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.16k stars 725 forks source link

ppo2 performance and gpu utilization #308

Closed hn2 closed 5 years ago

hn2 commented 5 years ago

I am running a ppo2 model. I see high cpu utilization and low gpu utilization.

When running:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

I get:

Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib
>>> print(device_lib.list_local_devices())
2019-05-06 11:06:02.117760: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-05-06 11:06:02.341488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1660 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.92GiB
2019-05-06 11:06:02.348112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-06 11:06:02.838521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-06 11:06:02.842724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2019-05-06 11:06:02.845154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2019-05-06 11:06:02.848092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 4641 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 8905916217148098349
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 4866611609
locality {
  bus_id: 1
  links {
  }
}
incarnation: 7192145949653879362
physical_device_desc: "device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5"
]

I understand that tensorflow is "seeing" my gpu. Why is the low utilization when training a stable baseline model?

# multiprocess environment
n_cpu = 4
env = PortfolioEnv(total_steps=settings['total_steps'], window_length=settings['window_length'], allow_short=settings['allow_short'] )
env = SubprocVecEnv([lambda: env for i in range(n_cpu)])

if settings['policy'] == 'MlpPolicy':
    model = PPO2(MlpPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])
elif settings['policy'] == 'MlpLstmPolicy': 
    model = PPO2(MlpLstmPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])
elif settings['policy'] == 'MlpLnLstmPolicy': 
    model = PPO2(MlpLnLstmPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])

model.learn(total_timesteps=settings['total_timesteps'])

model_name = str(settings['model_name']) + '_' + str(settings['policy']) + '_' + str(settings['total_timesteps']) + '_' + str(settings['total_steps']) + '_' + str(settings['window_length']) + '_' + str(settings['allow_short'])  
model.save(model_name)
hn2 commented 5 years ago

I see about 50% cpu utilization with core i7 cpu and <= 10% for gpu

hill-a commented 5 years ago

My guess is that your environment is too simple, this can cause the GPU and CPU to wait each other as the CPU is trying to run the environment with high multiprocess overhead (when compared to the load), and then having to wait for the GPU latency for the given batch size.

You are also using a very powerful GPU for a very simple task, hence the 10% load on the GPU.

Just as a side note, what CPU are you using exactly? I'm supprised to see a high power GPU combined with a 4 threaded i7, are you sure it's not 8 threads?

EDIT: checking on intel ARK for 4 threaded desktop CPUs None of them are i7, and when switched to laptop they are all low power CPUs for utlrabooks. n_cpu is the number of cpu threads not the number of cpu cores.

hn2 commented 5 years ago

This is my pc configuration: https://www.userbenchmark.com/UserRun/16739440

Also, I tested my portfolio env several times with different instruments and parameters and reward never exceeded -0.5 and that's weird.

hill-a commented 5 years ago

i7-8700, 6 cores 12 threads Try again with n_cpu = 12.

As for the reward, it's possible the methods do not work with your problem. This is still machine learning, and there are no magic bullets unfortunatly.

hn2 commented 5 years ago

Yes I tried that with c_cpu = 12. Still I see 12 processes spawned in the task manager with only one using gpu with very low utilization ~ 2%. All other processes don't use gpu at all - 0%. As to the reward, the original implementation on GitHub works and profitable. It doesn't make sense that out of millions of simulation runs not even one is profitable.

hill-a commented 5 years ago

Still I see 12 processes spawned in the task manager with only one using gpu with very low utilization ~ 2%. All other processes don't use gpu at all

Thats normal, after the steps of the environments ends, the processes send the data to the master thread, which then passes through the neural network. So only one processe is using the GPU, and the rest is simulating your environment with the CPUs. The goal of Multi - CPU environments is to reduce the time to simulate the environment and run more steps per seconds to feed the GPU. If the CPUs cannot simulate any faster (either due to a lack of computing power or Amdahl's law), then the GPU will invariantly be slowed down.

As to the reward, the original implementation on GitHub works and profitable. It doesn't make sense that out of millions of simulation runs not even one is profitable.

Can you show a benchmark with a specific method I could compare too? so I can make sure if this is an implementation issue or not for a given method.

hn2 commented 5 years ago

Does this mean that I wasted money on gpu? I can not use it to accelerate training?

hill-a commented 5 years ago

Does this mean that I wasted money on gpu? I can not use it to accelerate training?

You can throw a bigger network at your problem (by default it is 2 layers of 64), that will use more GPU power and might help your convergence.

from the documentation:

from stable_baselines.common.policies import FeedForwardPolicy

# Custom MLP policy of three layers of size 128 each
class CustomPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                           net_arch=[dict(pi=[128, 128, 128],
                                                          vf=[128, 128, 128])],
                                           feature_extraction="mlp")

model = PPO2(CustomPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])
hn2 commented 5 years ago

What is pi and what is vf? What if I want custom custom MlpLnLstmPolicy?

hill-a commented 5 years ago

on the documentation page, it says this:

The LstmPolicy can be used to construct recurrent policies in a similar way:

class CustomLSTMPolicy(LstmPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=64, reuse=False, **_kwargs):
        super().__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm, reuse,
                         net_arch=[8, 'lstm', dict(vf=[5, 10], pi=[10])],
                         layer_norm=True, feature_extraction="mlp", **_kwargs)

so:

from stable_baselines.common.policies import LstmPolicy

# Custom MLP policy of three layers of size 128 each
class CustomPolicy(LstmPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                           net_arch=[8, 
                                                     'lstm', 
                                                     dict(pi=[128, 128, 128],
                                                          vf=[128, 128, 128])],
                                           layer_norm=True, feature_extraction="mlp")

model = PPO2(CustomPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])

What is pi and what is vf?

pi is the policy function, vf is the value function (here is a really good write up if you want to know more about actor critic models)

hn2 commented 5 years ago

I am using now a custom policy but GPU utilization still very low (< 5%):

# Custom MLP policy of five layers
class CustomPolicy(LstmPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                        net_arch=[8, 
                                                    'lstm', 
                                                    dict(pi=[2048, 1024, 512, 256, 128],
                                                         vf=[2048, 1024, 512, 256, 128])],
                                        layer_norm=True, feature_extraction="mlp")

model = PPO2(CustomPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])
hn2 commented 5 years ago

Another question, once I have the model trained, how do I use it? Create observation and use predict? Do I have to step the env?

n_cpu = 12
env = PortfolioEnv(history=history, abbreviation=instruments, steps=settings['steps'], window_length=settings['window_length'])
env = SubprocVecEnv([lambda: env for i in range(n_cpu)])

model_name = str(settings['model_name']) + '_' + str(settings['policy']) + '_' + str(settings['total_timesteps']) + '_' + str(settings['total_steps']) + '_' + str(settings['window_length']) + '_' + str(settings['allow_short'])  
model = PPO2.load(model_name)

obs = env.reset()
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
hill-a commented 5 years ago

I am using now a custom policy but GPU utilization still very low (< 5%)

How is your CPU? at least one will be a bottleneck, not supprising that it would be the CPU. Just for reference OpenAI use massive CNNs on 128000 CPUs for 256 GPUs with OpenAI Five. Hence MLP on 16 threads will have trouble saturating a GPU. You can benchmark with timing code, most likely you have a non negligible speed up with a GPU however.

Another question, once I have the model trained, how do I use it? Create observation and use predict? Do I have to step the env?

when the model is trained you can simply give it the observations you wish to use. However if you are using reccurent networks, you need to use the state in the predict function:

states = model.initial_state  # get the initial state vector for the reccurent network
dones = np.zeros(states.shape[0])  # set all environment to not done

...

# in your loop
action, _values, states, _neglog = model.predict(obs, states, dones) 
# where obs is the observation you want to use the model on in production

...
hn2 commented 5 years ago

I am not sure that I understand what state is. This is my code, how do I construct the obs and state?

### Quantiacs RL
# import necessary Packages below:
import numpy as np
from quantiacsToolbox.quantiacsToolbox import runts
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import PPO2
from portfolio import PortfolioEnv

def myTradingSystem(DATE, OPEN, HIGH, LOW, CLOSE, VOL, exposure, equity, settings):
    ''' This system uses trend following techniques to allocate capital into the desired equities'''

    nMarkets = CLOSE.shape[1]
    pos = np.zeros(nMarkets)

    instruments = []
    history = np.empty(shape=(len(settings["markets"]), len(OPEN), 5), dtype=np.float)

    instruments = settings["markets"]
    for m in range(len(instruments)):     
        for d in range(len(OPEN)):
            history[m][d] = np.array([OPEN[d,m], HIGH[d,m], LOW[d,m], CLOSE[d,m], VOL[d,m]])

    # write_to_h5py(history, instruments, 'datasets/' + settings['model_name'] + '.h5')

    # multiprocess environment
    n_cpu = 12
    env = PortfolioEnv(history=history, abbreviation=instruments, steps=settings['steps'], window_length=settings['window_length'])
    env = SubprocVecEnv([lambda: env for i in range(n_cpu)])

    print(settings['model_filename'])
    model = PPO2.load(settings['model_filename'])

    obs = env.reset()
    action, _states = model.predict(obs)
    '''
    while True:
        action, _states = model.predict(obs)
        obs, rewards, dones, info = env.step(action)
        env.render()
    '''
    #   weights = pos/np.nansum(abs(pos))

    weights = action
    return weights, settings

def mySettings():
    ''' Define your trading system settings here '''

    settings = {}

    settings['markets'] = ['CASH', 'F_AD', 'F_BP', 'F_CD', 'F_EC', 'F_JY','F_SF', 'F_ND'] 

    settings['lookback'] = 2300 
    settings['budget'] = 10**6
    settings['slippage'] = 0.05
    settings['endInSample'] = '20150101'
    settings['beginInSample'] = '20050101'

    model = 'currencies'

    settings['steps'] = 2000
    settings['window_length'] = 3
    settings['allow_short'] = False 
    settings['total_timesteps'] = 10000000     #   100000000
    settings['model_name'] = model  + '_' + settings['beginInSample'] + '_' + settings['endInSample']
    settings['model_filename'] = model  + '_' + settings['beginInSample'] + '_' + settings['endInSample'] + '_' + str(settings['total_timesteps']) + '_' + str(settings['steps']) + '_' + str(settings['window_length'])   
    #   tensorboard --logdir=tensorboard   tensorboard --logdir=src

    return settings

# Evaluate trading system defined in current file.
if __name__ == '__main__':
    results = runts(__file__)
    #optimize(__file__)
hill-a commented 5 years ago

I am not sure that I understand what state is.

In your case, the state is the LSTM internal state (denoted h_t and c_t)

image ( LSTM cell diagram, image from here )

This is my code, how do I construct the obs and state?

I already showed you how to construct the inital state:

# intialized here
states = model.initial_state  # get the initial state vector for the reccurent network
dones = np.zeros(states.shape[0])  # set all environment to not done

# updated here
action, _values, states, _neglog = model.predict(obs, states, dones) 

as for the observation, I dont know, this is not my code and I dont understand the usage or the purpose. It should be a numpy array of the same shape as the environment.

hn2 commented 5 years ago

Ok got it. Hopefully almost there. One more problem in model = PPO2.load(settings['model_filename']) File is there. I tried also with admin privileges but it doesn't work.

<class 'PermissionError'>
Traceback (most recent call last):
  File "C:\Users\hanna\Anaconda3\lib\site-packages\quantiacsToolbox\quantiacsToolbox.py", line 871, in runts
    position, settings = TSobject.myTradingSystem(*argList)
  File "ppo2_quantiacs_test2.py", line 33, in myTradingSystem
    model = PPO2.load(settings['model_filename'])
  File "c:\users\hanna\stable-baselines\stable_baselines\common\base_class.py", line 550, in load
    data, params = cls._load_from_file(load_path)
  File "c:\users\hanna\stable-baselines\stable_baselines\common\base_class.py", line 361, in _load_from_file
    with open(load_path, "rb") as file:
PermissionError: [Errno 13] Permission denied: 'currencies_20050101_20150101_10000000_2000_3'
hill-a commented 5 years ago

PermissionError: [Errno 13] Permission denied: 'currencies_20050101_20150101_10000000_2000_3'

That's a directory no?

hn2 commented 5 years ago

mmm... I have both tensorflow log directory with that name as well as a pkl file with the same name.

hn2 commented 5 years ago

Ok. Directory renamed. Now got

Traceback (most recent call last):
  File "C:\Users\hanna\Anaconda3\lib\site-packages\quantiacsToolbox\quantiacsToolbox.py", line 871, in runts
    position, settings = TSobject.myTradingSystem(*argList)
  File "ppo2_quantiacs_test2.py", line 41, in myTradingSystem
    action, _values, states, _neglog = model.predict(obs, states, dones)
ValueError: not enough values to unpack (expected 4, got 2)
hn2 commented 5 years ago

Also, why is it refusing to accept feature_extraction='cnn' ?

hn2 commented 5 years ago
print(np.shape(obs))
print(np.shape(states))
print(np.shape(dones))

print(obs)
print(states)
print(dones)
(12, 120)
(12, 512)
(12,)

[[1.000000e+00 1.000000e+00 1.000000e+00 ... 1.057500e+05 1.061250e+05
  1.555900e+04]
 [1.000000e+00 1.000000e+00 1.000000e+00 ... 9.853750e+04 9.980000e+04
  4.457200e+04]
 [1.000000e+00 1.000000e+00 1.000000e+00 ... 1.042000e+05 1.044875e+05
  1.994300e+04]
 ...
 [1.000000e+00 1.000000e+00 1.000000e+00 ... 9.571250e+04 9.615000e+04
  2.808500e+04]
 [1.000000e+00 1.000000e+00 1.000000e+00 ... 9.853750e+04 9.980000e+04
  4.457200e+04]
 [1.000000e+00 1.000000e+00 1.000000e+00 ... 1.054250e+05 1.057750e+05
  1.149000e+04]]

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
obs = env.reset()
states = model.initial_state  # get the initial state vector for the reccurent network
dones = np.zeros(states.shape[0])  # set all environment to not done

print(np.shape(obs))
print(np.shape(states))
print(np.shape(dones))

print(obs)
print(states)
print(dones)

# updated here
# action, _values, states, _neglog = model.predict(obs, states, dones)
action, _states = model.predict(obs, states, dones)

print(action)
[[nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]]
hn2 commented 5 years ago
ValueError: could not broadcast input array from shape (12,8) into shape (8)

Why is it all nan and where does the 12 rows come from?

hn2 commented 5 years ago

Can anyone help why predict doesn't work?

hn2 commented 5 years ago

Anyone?

hn2 commented 5 years ago

As to the GPU utilization problem, I think that windows performance monitor doesn't show the correct utilization. I tried GPU-Z and it shows 30-60% GPU load.

hn2 commented 5 years ago

I figured out that the 12 rows in the action comes from number of cpu's. When I change to n_cpu = 1 I get: ValueError: Cannot feed value of shape (1, 120) for Tensor 'input/Ob:0', which has shape '(12, 120)' How do I predict then? How do I combine results from multiprocess env to one action?

op1490 commented 5 years ago

I am also struggling with this - anyone have any ideas?

troychen728 commented 5 years ago

I figured out that the 12 rows in the action comes from number of cpu's. When I change to n_cpu = 1 I get: ValueError: Cannot feed value of shape (1, 120) for Tensor 'input/Ob:0', which has shape '(12, 120)' How do I predict then? How do I combine results from multiprocess env to one action?

I am struggling with this too. In my case I just created 12 parallel test environments, and the result I get has dimension 12. I just flattened them, and treated as 12 individual test points? I am not sure. Would appreciate it a lot if someone can shed light on this one.

araffin commented 5 years ago

@op1490 @troychen728 for predicting for only one env, you can find a solution here: https://github.com/hill-a/stable-baselines/issues/166#issuecomment-502350843

dbsxdbsx commented 5 years ago

Hi, I am new here. And I still don't understand how to use gpu with tensorflow within stable-baseline. It seems that the gpu is automatically used when tensorflow-gpu is installed correctly?

Miffyli commented 5 years ago

@dbsxdbsx Yes, if you have tensorflow-gpu installed, then most of the stable-baselines algorithms will use GPU.

dbsxdbsx commented 5 years ago

@Thanks.