Closed aGiant closed 5 years ago
Hey, this is definitely not expected. Can you reproduce this with https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/cartpole_lstm.py (this is the LSTM integration test and toy example).
Edit: 0.6.0 has an unfortunate bug in action clipping that clips action values sent to the learner as well and not just the environment (see the release notes). It looks like fixing this is the cause of the change. In 0.6.1 we won't auto clip actions returned by compute_action(). However you can clip manually with np.clip() which should give the same result as 0.6.0.
Does this diagnosis seem right?
Update: actually not sure if clipping is the cause here, would be good to see if the cartpole lstm example is affected.
And is this the error you mean? InvalidArgumentError: You must feed a value for placeholder tensor 'default/prev_reward' with dtype float and shape [?] [[node default/prev_reward (defined at /home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/ppo/ppo_policy_graph.py:144) = Placeholderdtype=DT_FLOAT, shape=[?], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Errors for cartpole_lstm.py :
Traceback (most recent call last):
File "cartpole_lstm.py", line 191, in queue_trials=True
in ray.tune.run_experiments() or on the command line to queue trials until the cluster scales up.
You can adjust the resource requests of RLlib agents by setting num_workers
and other configs. See the DEFAULT_CONFIG defined by each agent for more info.
The config of this agent is: {'num_sgd_iter': 5, 'model': {'use_lstm': True, 'lstm_use_prev_action_reward': False}, 'env': 'cartpole_stateless'}
Try num_workers: 1, that's just your machine not having enough CPUs.
On Wed, Jan 9, 2019, 5:38 AM aGiant notifications@github.com wrote:
Errors for cartpole_lstm.py :
Traceback (most recent call last): File "cartpole_lstm.py", line 191, in "lstm_use_prev_action_reward": args. File "/home/llu/.local/lib/python3.6/site-packages/ray/tune/tune.py", line 170, in run_experiments runner.step() File "/home/llu/.local/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 218, in step trial.config))) ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 3 CPUs, 0 GPUs but the cluster has only 2 CPUs, 0 GPUs. Pass queue_trials=True in ray.tune.run_experiments() or on the command line to queue trials until the cluster scales up.
You can adjust the resource requests of RLlib agents by setting num_workers and other configs. See the DEFAULT_CONFIG defined by each agent for more info.
The config of this agent is: {'num_sgd_iter': 5, 'model': {'use_lstm': True, 'lstm_use_prev_action_reward': False}, 'env': 'cartpole_stateless'}
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728#issuecomment-452699150, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6SjNQU9Fsvh1Nqyf1h_RpQgJZbmC1ks5vBfDOgaJpZM4Z3L9V .
Settings: configs = { "PPO": { "num_sgd_iter": 5, }, "IMPALA": { "num_workers": 2, "num_gpus": 0, "vf_loss_coeff": 0.01, }, }
I only have 2 cpu local. Same error if set "num_workers": 1.
One CPU is reserved for the learner, so each worker adds 1.
I think you forgot to set workers for PPO...?
On Wed, Jan 9, 2019, 5:56 AM aGiant notifications@github.com wrote:
Settings: configs = { "PPO": { "num_sgd_iter": 5, }, "IMPALA": { "num_workers": 2, "num_gpus": 0, "vf_loss_coeff": 0.01, }, }
I only have 2 cpu local. Same error if set "num_workers": 1.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728#issuecomment-452704513, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6ShV1f8KPIwd8mNH-zwGbvaD32qD1ks5vBfT6gaJpZM4Z3L9V .
Ah, you're right and it works. configs = { "PPO": { "num_sgd_iter": 5, "num_workers": 1, }, "IMPALA": { "num_workers": 2, "num_gpus": 0, "vf_loss_coeff": 0.01, }, } == Status == Using FIFO scheduling algorithm. Resources requested: 2/2 CPUs, 0/0 GPUs Memory usage on this node: 8.6/11.4 GB Result logdir: /home/llu/ray_results/test RUNNING trials:
Tested and no Clip error. All data points were clean and normalized within (-1.0 and 1.0) before calling agent.train(). My question is about restoring and compute_action(). In this stage, the original data points were changed by agent, almost time 10^8. By the way, the example of cartpole lstm contians only training process.
I see, is it possible to attach a script I can run to reproduce this issue?
Shared files are under: https://www.dropbox.com/sh/p79bjj2ysbmpukp/AAAnn1FLJ6go85Elgw-qvOBXa?dl=0 Locally, I have one 8G redis database. In the link, I changed and tested with random numbers. Errors remain the same as fetching data from database.
Main issue is about restoring the saved LSTM model and executing out of sample test.
Many thanks!
Thanks! I should be able to look into this more Friday.
On Wed, Jan 9, 2019, 11:53 PM aGiant notifications@github.com wrote:
Shared files are under: https://www.dropbox.com/sh/p79bjj2ysbmpukp/AAAnn1FLJ6go85Elgw-qvOBXa?dl=0 Locally, I have one 8G redis database. In the link, I changed and tested with random numbers. Errors remain the same as fetching data from database.
Main issue is about restoring the saved LSTM model and executing out of sample test.
Many thanks!
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728#issuecomment-453002236, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6SoB4kC8_GydDfCnvco4kei36sDnPks5vBvF-gaJpZM4Z3L9V .
@aGiant , I couldn't get your script to run due to the redis data dependencies, but maybe try this example out (building on cartpole_lstm):
"checkpoint_freq": 1
to the cartpole_lstm training config to save checkpoints.
import ray
from ray import tune
import numpy as np
from ray.rllib.agents.ppo import PPOAgent from ray.rllib.examples.cartpole_lstm import CartPoleStatelessEnv
if name == "main": ray.init() tune.register_env("cartpolestateless", lambda : CartPoleStatelessEnv()) agent = PPOAgent(env="cartpole_stateless", config={ "num_workers": 0, "model": { "use_lstm": True, "lstm_use_prev_action_reward": True, }, })
# Or whereever your checkpoint is
agent.restore("/home/eric/ray_results/test/PPO_cartpole_stateless_0_2019-01-11_22-00-56i68jgju6/checkpoint_6/checkpoint-6")
env = CartPoleStatelessEnv()
acc = []
while True:
obs = env.reset()
done = False
prev_action = np.zeros_like(env.action_space.sample())
prev_reward = 0
info = {}
state = agent.get_policy().get_initial_state()
total_reward = 0
while not done:
action, state, fetch = agent.compute_action(
obs, state=state, prev_action=prev_action,
prev_reward=prev_reward, info=info)
obs, reward, done, info = env.step(action)
total_reward += reward
prev_reward = reward
prev_action = action
acc.append(total_reward)
print("Rollout complete, current mean reward", np.mean(acc))
I was able to do this and reproduce the mean training reward of the original agent. My best guess is there is some subtle bug in your rollouts code, so hopefully this helps.
That example worked, many thanks!
@ericl Errors remain the same. The values are much more bigger than original values and all predicted actions were not normal, almost all were out of range. In Dropbox, the "myEnv_Copy.py" file was corrected and the errors should be reproduced.
Many thanks!
@aGiant not sure what you mean, when I ran the train.py script with some fixes
if True:
import numpy as np
obs = env.reset()
state = agent.get_policy().get_initial_state()
prev_rew = 0
prev_act = np.zeros_like(env.action_space.sample())
rewards = []
done = False
while not done:
action, state, _ = agent.compute_action(
obs, state=state, prev_reward=prev_rew, prev_action=prev_act)
print("action", action)
obs, rew, done, info = env.step(action)
rewards.append(rew)
prev_rew = rew
print("Episode reward", np.mean(rewards))
I got normal looking actions:
action [ 1.1260735 -1.3211138 0.6528911 -0.22034293 -0.01135487 -0.9091983
1.4385319 -0.63426673 -0.5652876 -0.5376567 1.9951357 -0.31390783
-0.70996547 1.7107437 0.96269166 0.73231137 -1.2653104 1.1182361
-0.05449466 0.59806865 -0.14532986]
action [-0.5608993 -1.0731459 0.44879803 0.44107428 -0.19863613 -0.4424859
-1.2708163 0.5617583 -0.6878831 1.2864232 0.21852352 1.6283256
-0.8759136 -0.6002346 -1.4081013 -0.43308827 -0.7667263 -0.28446013
0.62706804 -0.39129162 1.2432218 ]
As defined in action_sapces, all actions should be within (-1,1). And from not corrected train.py, the output showed the value in holders were around 1e7 bigger. Those outputs were errors for calling compute_actions() and the inputs of observations were correct.
The weights of trained LSTM were also normal.
Ok, the issue was that the filters weren't synced after exactly 1 iteration, so the divisor was 0. This patch fixes it https://github.com/ray-project/ray/pull/3769 (or, you can train() twice and then the filters will be ok).
For the clipping, you'll have to handle that yourself if using agent.compute_action().
We have trained 2000 times and the final saved models are almost the same as trained once. Actions were out of ranges.
I did not get the clipping part. All of our original data were within (-1, 1), but after copy to the tensorflow holder in train() or compute_actions() (not sure about it), all data were saved in holders and changed to orgininal data times 1e7.
Till now, we did not find the place for that reason.
The 1e7 is due to the filter standard deviation value being zero, so you end up dividing the observations by epsilon => huge values.
I've checked and that pr does fix the issue. Did you resave a new more? The existing checkpoints have corrupt filter values.
On Mon, Jan 14, 2019, 1:52 AM aGiant notifications@github.com wrote:
We have trained 2000 times and the final saved models are almost the same as trained once. Actions were out of ranges.
I did not get the clipping part. All of our data were within (-1, 1), but after copy to the tensorflow holder, all data were saved in holder and changed to orgininal data times 1e7.
Till now, we did not find the place for that reason.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728#issuecomment-453949375, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6SuID9beKoXIq7Hydda_MrWrLQ2ZQks5vDFN4gaJpZM4Z3L9V .
Maybe this is the reason for that problem: (in ray/python/ray/rllib/utils/filter.py) if self.destd: x = x / (self.rs.std + 1e-8)
All the original data was timed by 1e8 if self.destd was True or not zero.
*new model
On Mon, Jan 14, 2019, 1:59 AM Eric Liang ekhliang@gmail.com wrote:
The 1e7 is due to the filter standard deviation value being zero, so you end up dividing the observations by epsilon => huge values.
I've checked and that pr does fix the issue. Did you resave a new more? The existing checkpoints have corrupt filter values.
On Mon, Jan 14, 2019, 1:52 AM aGiant notifications@github.com wrote:
We have trained 2000 times and the final saved models are almost the same as trained once. Actions were out of ranges.
I did not get the clipping part. All of our data were within (-1, 1), but after copy to the tensorflow holder, all data were saved in holder and changed to orgininal data times 1e7.
Till now, we did not find the place for that reason.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728#issuecomment-453949375, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6SuID9beKoXIq7Hydda_MrWrLQ2ZQks5vDFN4gaJpZM4Z3L9V .
@ericl New modes are uploaded to the link: https://www.dropbox.com/sh/p79bjj2ysbmpukp/AAAnn1FLJ6go85Elgw-qvOBXa?dl=0 MyEnv_Copy.py and train.py remain the same as before.
We trained 2000 times the PPO agent, saved under agents_sc. And the actions were almost out of our pre-defined ranges in myEnv_Copy.py.
Yes, exactly.
The bad filter saving in that corner case is fixed in that PR. Or, you can disable the filter during training with "observation_filter": "NoFilter"
On Mon, Jan 14, 2019, 1:59 AM aGiant notifications@github.com wrote:
Maybe this is the reason for that problem: (in ray/python/ray/rllib/utils/filter.py) if self.destd: x = x / (self.rs.std + 1e-8)
All the original data was timed by 1e8 if self.destd was True or not zero.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728#issuecomment-453951164, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6Sv8tGxRk1HUZwZwU5nOImBETVDc5ks5vDFUEgaJpZM4Z3L9V .
As I said before, the actions will not be clipped. You can simply np.clip() them afterwards.
They are now in a reasonable range and not 1e7 right?
On Mon, Jan 14, 2019, 2:08 AM Eric Liang ekhliang@gmail.com wrote:
Yes, exactly.
The bad filter saving in that corner case is fixed in that PR. Or, you can disable the filter during training with "observation_filter": "NoFilter"
On Mon, Jan 14, 2019, 1:59 AM aGiant notifications@github.com wrote:
Maybe this is the reason for that problem: (in ray/python/ray/rllib/utils/filter.py) if self.destd: x = x / (self.rs.std + 1e-8)
All the original data was timed by 1e8 if self.destd was True or not zero.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728#issuecomment-453951164, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6Sv8tGxRk1HUZwZwU5nOImBETVDc5ks5vDFUEgaJpZM4Z3L9V .
Ah, tested, you're right and now all values are normal. Many thanks!
Is there any other cofingureation that we should pay special attention to?
Great! Not sure about other configs besides the observation filter setting.
hi, im running ray on kubernetes with 4 machines with 4 cores each. im following the guide on 'Deploying on Kubernetes', and ive modified the worker and head yaml so that each requests 1 cpu. with this configuration, i want to run the following example with 15 workers:
rllib train --env=CartPole-v1 --run=PPO --config '{"num_workers": 15}' --queue_trials=True --stop '{"episode_reward_mean": 500, "timesteps_total": 200000}'
when i use 'queue_trials=True' i get:
rllib: error: unrecognized arguments: --queue_trials=True
and when i dont use it, i get:
ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 16 CPUs, 0 GPUs but the cluster has only 4 CPUs, 0 GPUs. Pass queue_trials=True
in ray.tune.run() or on the command line to queue trials until the cluster scales up.
ive built ray from source.
kubectl get pods
NAME READY STATUS RESTARTS AGE ray-head-f9dfbf6c8-l7fcj 1/1 Running 0 132m ray-worker-785b9c8576-8zmhq 1/1 Running 0 132m ray-worker-785b9c8576-lzsgq 1/1 Running 0 132m ray-worker-785b9c8576-pm2m8 1/1 Running 0 132m
how can i use accomplish the following command:
rllib train --env=CartPole-v1 --run=PPO --config '{"num_workers": 15}' --queue_trials=True --stop '{"episode_reward_mean": 500, "timesteps_total": 200000}'
thank you for all your work!
I think it's now "--queue-trials".
On Wed, May 22, 2019, 8:17 AM buedaswag notifications@github.com wrote:
hi, im running ray on kubernetes with 4 machines with 4 cores each. im following the guide on 'Deploying on Kubernetes', and ive modified the worker and head yaml so that each requests 1 cpu. with this configuration, i want to run the following example with 15 workers:
rllib train --env=CartPole-v1 --run=PPO --config '{"num_workers": 15}' --queue_trials=True --stop '{"episode_reward_mean": 500, "timesteps_total": 200000}'
when i use 'queue_trials=True' i get:
rllib: error: unrecognized arguments: --queue_trials=True
and when i dont use it, i get:
ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 16 CPUs, 0 GPUs but the cluster has only 4 CPUs, 0 GPUs. Pass queue_trials=True in ray.tune.run() or on the command line to queue trials until the cluster scales up.
ive built ray from source.
kubectl get pods
NAME READY STATUS RESTARTS AGE ray-head-f9dfbf6c8-l7fcj 1/1 Running 0 132m ray-worker-785b9c8576-8zmhq 1/1 Running 0 132m ray-worker-785b9c8576-lzsgq 1/1 Running 0 132m ray-worker-785b9c8576-pm2m8 1/1 Running 0 132m
how can i use accomplish the following command:
rllib train --env=CartPole-v1 --run=PPO --config '{"num_workers": 15}' --queue_trials=True --stop '{"episode_reward_mean": 500, "timesteps_total": 200000}'
thank you for all your work!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728?email_source=notifications&email_token=AAADUSWDQDLVJVQT3OIQIFLPWVP2HA5CNFSM4GO4X5K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV7MTNY#issuecomment-494848439, or mute the thread https://github.com/notifications/unsubscribe-auth/AAADUSUIFHJ3RH5Q26WGPK3PWVP2HANCNFSM4GO4X5KQ .
Yes, it is. Without the '=True'. Thank you
On Wed, May 22, 2019, 7:26 PM Eric Liang notifications@github.com wrote:
I think it's now "--queue-trials".
On Wed, May 22, 2019, 8:17 AM buedaswag notifications@github.com wrote:
hi, im running ray on kubernetes with 4 machines with 4 cores each. im following the guide on 'Deploying on Kubernetes', and ive modified the worker and head yaml so that each requests 1 cpu. with this configuration, i want to run the following example with 15 workers:
rllib train --env=CartPole-v1 --run=PPO --config '{"num_workers": 15}' --queue_trials=True --stop '{"episode_reward_mean": 500, "timesteps_total": 200000}'
when i use 'queue_trials=True' i get:
rllib: error: unrecognized arguments: --queue_trials=True
and when i dont use it, i get:
ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 16 CPUs, 0 GPUs but the cluster has only 4 CPUs, 0 GPUs. Pass queue_trials=True in ray.tune.run() or on the command line to queue trials until the cluster scales up.
ive built ray from source.
kubectl get pods
NAME READY STATUS RESTARTS AGE ray-head-f9dfbf6c8-l7fcj 1/1 Running 0 132m ray-worker-785b9c8576-8zmhq 1/1 Running 0 132m ray-worker-785b9c8576-lzsgq 1/1 Running 0 132m ray-worker-785b9c8576-pm2m8 1/1 Running 0 132m
how can i use accomplish the following command:
rllib train --env=CartPole-v1 --run=PPO --config '{"num_workers": 15}' --queue_trials=True --stop '{"episode_reward_mean": 500, "timesteps_total": 200000}'
thank you for all your work!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/ray-project/ray/issues/3728?email_source=notifications&email_token=AAADUSWDQDLVJVQT3OIQIFLPWVP2HA5CNFSM4GO4X5K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV7MTNY#issuecomment-494848439 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AAADUSUIFHJ3RH5Q26WGPK3PWVP2HANCNFSM4GO4X5KQ
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728?email_source=notifications&email_token=AIC5NA5KHJUYVFCQVXLUV7DPWWFXFA5CNFSM4GO4X5K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV74ENI#issuecomment-494912053, or mute the thread https://github.com/notifications/unsubscribe-auth/AIC5NA3VDVSCJBQCZZ44RMDPWWFXFANCNFSM4GO4X5KQ . [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": " https://github.com/ray-project/ray/issues/3728?email_source=notifications\u0026email_token=AIC5NA5KHJUYVFCQVXLUV7DPWWFXFA5CNFSM4GO4X5K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV74ENI#issuecomment-494912053", "url": " https://github.com/ray-project/ray/issues/3728?email_source=notifications\u0026email_token=AIC5NA5KHJUYVFCQVXLUV7DPWWFXFA5CNFSM4GO4X5K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV74ENI#issuecomment-494912053", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": " https://github.com" } } ]
System information
Describe the problem
After update from 0.6.0 to 0.6.1, some errors are really unclear. The original data looks like this: 5.23047522e-02 2.17806064e-02 1.10403430e-02 5.32237291e-01 4.24376049e-04 2.36710180e-02 -1.31596169e-02 4.35462594e-01 -2.03338396e-02 1.44634377e-02 2.79079884e-01 -2.52940543e-02 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01
But the data in holder looks like this:
5.23047522e+06, 2.17806064e+06, 1.10403430e+06, 5.32237291e+07, 4.24376049e+04, 2.36710180e+06, -1.31596169e+06, 4.35462594e+07, -2.03338396e+06, 1.44634377e+06, 2.79079884e+07, -2.52940543e+06, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07,
Is there any document to explain why the numbers in holder were reformed? Furthermore, can you guys give us one completed usage example of training and restoring utilization of PPO LSTM with last action=Ture? Many thanks!
Source code / logs
Setting of agent: {'monitor': False, 'log_level': 'INFO', 'callbacks': {'on_episode_start': None, 'on_episode_step': None, 'on_episode_end': None, 'on_sample_end': None, 'on_train_result': None}, 'model': {'conv_filters': None, 'conv_activation': 'relu', 'fcnet_activation': 'tanh', 'fcnet_hiddens': [256, 256], 'free_log_std': False, 'squash_to_range': False, 'use_lstm': True, 'max_seq_len': 864, 'lstm_cell_size': 256, 'lstm_use_prev_action_reward': True, 'framestack': True, 'dim': 84, 'channel_major': False, 'grayscale': False, 'zero_mean': True, 'custom_preprocessor': None, 'custom_model': None, 'custom_options': {}}, 'optimizer': {}, 'gamma': 0.98, 'horizon': None, 'env_config': {}, 'env': None, 'clip_rewards': None, 'clip_actions': True, 'preprocessor_pref': 'deepmind', 'num_workers': 1, 'num_gpus': 0, 'num_cpus_per_worker': 1, 'num_gpus_per_worker': 0, 'custom_resources_per_worker': {}, 'num_cpus_for_driver': 1, 'num_envs_per_worker': 1, 'sample_batch_size': 2592, 'train_batch_size': 25920, 'batch_mode': 'truncate_episodes', 'sample_async': False, 'observation_filter': 'MeanStdFilter', 'synchronize_filters': True, 'tf_session_args': {'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2, 'gpu_options': {'allow_growth': True}, 'log_device_placement': False, 'device_count': {'CPU': 1}, 'allow_soft_placement': True}, 'local_evaluator_tf_session_args': {'intra_op_parallelism_threads': 8, 'inter_op_parallelism_threads': 8}, 'compress_observations': False, 'collect_metrics_timeout': 180, 'input': 'sampler', 'input_evaluation': None, 'output': None, 'output_compress_columns': ['obs', 'new_obs'], 'output_max_file_size': 67108864, 'multiagent': {'policy_graphs': {}, 'policy_mapping_fn': None, 'policies_to_train': None}, 'use_gae': True, 'lambda': 1.0, 'kl_coeff': 0.2, 'sgd_minibatch_size': 864, 'num_sgd_iter': 30, 'lr': 5e-05, 'lr_schedule': None, 'vf_share_layers': False, 'vf_loss_coeff': 1.0, 'entropy_coeff': 0.0, 'clip_param': 0.3, 'vf_clip_param': 10.0, 'kl_target': 0.01, 'simple_optimizer': False, 'straggler_mitigation': False}
Errors: Traceback (most recent call last): File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/utils/tf_run_builder.py", line 47, in get self.feed_dict, os.environ.get("TF_TIMELINE_DIR")) File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/utils/tf_run_builder.py", line 85, in run_timeline fetches = sess.run(ops, feed_dict=feed_dict) File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'default/prev_reward' with dtype float and shape [?] [[node default/prev_reward (defined at /home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/ppo/ppo_policy_graph.py:144) = Placeholder[dtype=DT_FLOAT, shape=[?], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'default/prev_reward', defined at: File "/home/llu/c7_triangle/train.py", line 56, in
agent = ppo.PPOAgent(config=config, env="my_env")
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/agent.py", line 244, in init
Trainable.init(self, config, logger_creator)
File "/home/llu/.local/lib/python3.6/site-packages/ray/tune/trainable.py", line 87, in init
self._setup(copy.deepcopy(self.config))
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/agent.py", line 312, in _setup
self._init()
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/ppo/ppo.py", line 75, in _init
self.env_creator, self._policy_graph)
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/agent.py", line 413, in make_local_evaluator
config["local_evaluator_tf_session_args"]
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/agent.py", line 533, in _make_evaluator
output_creator=output_creator)
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/evaluation/policy_evaluator.py", line 272, in init
self._build_policy_map(policy_dict, policy_config)
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/evaluation/policy_evaluator.py", line 579, in _build_policy_map
policy_map[name] = cls(obs_space, act_space, merged_conf)
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/ppo/ppo_policy_graph.py", line 144, in init
tf.float32, [None], name="prev_reward")
File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1747, in placeholder
return gen_array_ops.placeholder(dtype=dtype, shape=shape, name=name)
File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5206, in placeholder
"Placeholder", dtype=dtype, shape=shape, name=name)
File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'default/prev_reward' with dtype float and shape [?] [[node default/prev_reward (defined at /home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/ppo/ppo_policy_graph.py:144) = Placeholder[dtype=DT_FLOAT, shape=[?], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/llu/c7_triangle/train.py", line 79, in
act = agent.compute_action(env.observation, state= agent.local_evaluator.policy_map["default"].get_initial_state())
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/agent.py", line 362, in compute_action
policy_id=policy_id)
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/evaluation/policy_evaluator.py", line 496, in for_policy
return func(self.policy_map[policy_id])
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/agent.py", line 361, in
lambda p: p.compute_single_action(filtered_obs, state),
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/evaluation/policy_graph.py", line 99, in compute_single_action
[obs], [[s] for s in state], episodes=[episode])
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/evaluation/tf_policy_graph.py", line 163, in compute_actions
return builder.get(fetches)
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/utils/tf_run_builder.py", line 50, in get
self.fetches, self.feed_dict))