[rllib] change in output from agent.compute_actions() post 0.6.0

aGiant commented 5 years ago

System information

Ubuntu 18.04:
Ray installed from "pip3 install -U ray":
Ray version 0.6.1:
Python 3.6:
Exact command to reproduce: act = agent.compute_action(env.observation, state= agent.local_evaluator.policy_map["default"].get_initial_state()):

Describe the problem

After update from 0.6.0 to 0.6.1, some errors are really unclear. The original data looks like this: 5.23047522e-02 2.17806064e-02 1.10403430e-02 5.32237291e-01 4.24376049e-04 2.36710180e-02 -1.31596169e-02 4.35462594e-01 -2.03338396e-02 1.44634377e-02 2.79079884e-01 -2.52940543e-02 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01 1.91000000e-01

But the data in holder looks like this:

5.23047522e+06, 2.17806064e+06, 1.10403430e+06, 5.32237291e+07, 4.24376049e+04, 2.36710180e+06, -1.31596169e+06, 4.35462594e+07, -2.03338396e+06, 1.44634377e+06, 2.79079884e+07, -2.52940543e+06, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07, 1.91000000e+07,

Is there any document to explain why the numbers in holder were reformed? Furthermore, can you guys give us one completed usage example of training and restoring utilization of PPO LSTM with last action=Ture? Many thanks!

Source code / logs

Setting of agent: {'monitor': False, 'log_level': 'INFO', 'callbacks': {'on_episode_start': None, 'on_episode_step': None, 'on_episode_end': None, 'on_sample_end': None, 'on_train_result': None}, 'model': {'conv_filters': None, 'conv_activation': 'relu', 'fcnet_activation': 'tanh', 'fcnet_hiddens': [256, 256], 'free_log_std': False, 'squash_to_range': False, 'use_lstm': True, 'max_seq_len': 864, 'lstm_cell_size': 256, 'lstm_use_prev_action_reward': True, 'framestack': True, 'dim': 84, 'channel_major': False, 'grayscale': False, 'zero_mean': True, 'custom_preprocessor': None, 'custom_model': None, 'custom_options': {}}, 'optimizer': {}, 'gamma': 0.98, 'horizon': None, 'env_config': {}, 'env': None, 'clip_rewards': None, 'clip_actions': True, 'preprocessor_pref': 'deepmind', 'num_workers': 1, 'num_gpus': 0, 'num_cpus_per_worker': 1, 'num_gpus_per_worker': 0, 'custom_resources_per_worker': {}, 'num_cpus_for_driver': 1, 'num_envs_per_worker': 1, 'sample_batch_size': 2592, 'train_batch_size': 25920, 'batch_mode': 'truncate_episodes', 'sample_async': False, 'observation_filter': 'MeanStdFilter', 'synchronize_filters': True, 'tf_session_args': {'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2, 'gpu_options': {'allow_growth': True}, 'log_device_placement': False, 'device_count': {'CPU': 1}, 'allow_soft_placement': True}, 'local_evaluator_tf_session_args': {'intra_op_parallelism_threads': 8, 'inter_op_parallelism_threads': 8}, 'compress_observations': False, 'collect_metrics_timeout': 180, 'input': 'sampler', 'input_evaluation': None, 'output': None, 'output_compress_columns': ['obs', 'new_obs'], 'output_max_file_size': 67108864, 'multiagent': {'policy_graphs': {}, 'policy_mapping_fn': None, 'policies_to_train': None}, 'use_gae': True, 'lambda': 1.0, 'kl_coeff': 0.2, 'sgd_minibatch_size': 864, 'num_sgd_iter': 30, 'lr': 5e-05, 'lr_schedule': None, 'vf_share_layers': False, 'vf_loss_coeff': 1.0, 'entropy_coeff': 0.0, 'clip_param': 0.3, 'vf_clip_param': 10.0, 'kl_target': 0.01, 'simple_optimizer': False, 'straggler_mitigation': False}

Errors: Traceback (most recent call last): File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/utils/tf_run_builder.py", line 47, in get self.feed_dict, os.environ.get("TF_TIMELINE_DIR")) File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/utils/tf_run_builder.py", line 85, in run_timeline fetches = sess.run(ops, feed_dict=feed_dict) File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'default/prev_reward' with dtype float and shape [?] [[node default/prev_reward (defined at /home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/ppo/ppo_policy_graph.py:144) = Placeholder[dtype=DT_FLOAT, shape=[?], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'default/prev_reward', defined at: File "/home/llu/c7_triangle/train.py", line 56, in agent = ppo.PPOAgent(config=config, env="my_env") File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/agent.py", line 244, in init Trainable.init(self, config, logger_creator) File "/home/llu/.local/lib/python3.6/site-packages/ray/tune/trainable.py", line 87, in init self._setup(copy.deepcopy(self.config)) File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/agent.py", line 312, in _setup self._init() File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/ppo/ppo.py", line 75, in _init self.env_creator, self._policy_graph) File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/agent.py", line 413, in make_local_evaluator config["local_evaluator_tf_session_args"] File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/agent.py", line 533, in _make_evaluator output_creator=output_creator) File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/evaluation/policy_evaluator.py", line 272, in init self._build_policy_map(policy_dict, policy_config) File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/evaluation/policy_evaluator.py", line 579, in _build_policy_map policy_map[name] = cls(obs_space, act_space, merged_conf) File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/ppo/ppo_policy_graph.py", line 144, in init tf.float32, [None], name="prev_reward") File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1747, in placeholder return gen_array_ops.placeholder(dtype=dtype, shape=shape, name=name) File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5206, in placeholder "Placeholder", dtype=dtype, shape=shape, name=name) File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(*args, **kwargs) File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/home/llu/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in init self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'default/prev_reward' with dtype float and shape [?] [[node default/prev_reward (defined at /home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/ppo/ppo_policy_graph.py:144) = Placeholder[dtype=DT_FLOAT, shape=[?], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/llu/c7_triangle/train.py", line 79, in act = agent.compute_action(env.observation, state= agent.local_evaluator.policy_map["default"].get_initial_state()) File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/agent.py", line 362, in compute_action policy_id=policy_id) File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/evaluation/policy_evaluator.py", line 496, in for_policy return func(self.policy_map[policy_id]) File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/agent.py", line 361, in lambda p: p.compute_single_action(filtered_obs, state), File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/evaluation/policy_graph.py", line 99, in compute_single_action [obs], [[s] for s in state], episodes=[episode]) File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/evaluation/tf_policy_graph.py", line 163, in compute_actions return builder.get(fetches) File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/utils/tf_run_builder.py", line 50, in get self.fetches, self.feed_dict))

ericl commented 5 years ago

Hey, this is definitely not expected. Can you reproduce this with https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/cartpole_lstm.py (this is the LSTM integration test and toy example).

Edit: 0.6.0 has an unfortunate bug in action clipping that clips action values sent to the learner as well and not just the environment (see the release notes). It looks like fixing this is the cause of the change. In 0.6.1 we won't auto clip actions returned by compute_action(). However you can clip manually with np.clip() which should give the same result as 0.6.0.

Does this diagnosis seem right?

ericl commented 5 years ago

Update: actually not sure if clipping is the cause here, would be good to see if the cartpole lstm example is affected.

And is this the error you mean? InvalidArgumentError: You must feed a value for placeholder tensor 'default/prev_reward' with dtype float and shape [?] [[node default/prev_reward (defined at /home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/ppo/ppo_policy_graph.py:144) = Placeholderdtype=DT_FLOAT, shape=[?], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

aGiant commented 5 years ago

Errors for cartpole_lstm.py :

Traceback (most recent call last): File "cartpole_lstm.py", line 191, in "lstm_use_prev_action_reward": args. File "/home/llu/.local/lib/python3.6/site-packages/ray/tune/tune.py", line 170, in run_experiments runner.step() File "/home/llu/.local/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 218, in step trial.config))) ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 3 CPUs, 0 GPUs but the cluster has only 2 CPUs, 0 GPUs. Pass queue_trials=True in ray.tune.run_experiments() or on the command line to queue trials until the cluster scales up.

You can adjust the resource requests of RLlib agents by setting num_workers and other configs. See the DEFAULT_CONFIG defined by each agent for more info.

The config of this agent is: {'num_sgd_iter': 5, 'model': {'use_lstm': True, 'lstm_use_prev_action_reward': False}, 'env': 'cartpole_stateless'}

ericl commented 5 years ago

Try num_workers: 1, that's just your machine not having enough CPUs.

On Wed, Jan 9, 2019, 5:38 AM aGiant notifications@github.com wrote:

Errors for cartpole_lstm.py :

Traceback (most recent call last): File "cartpole_lstm.py", line 191, in "lstm_use_prev_action_reward": args. File "/home/llu/.local/lib/python3.6/site-packages/ray/tune/tune.py", line 170, in run_experiments runner.step() File "/home/llu/.local/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 218, in step trial.config))) ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 3 CPUs, 0 GPUs but the cluster has only 2 CPUs, 0 GPUs. Pass queue_trials=True in ray.tune.run_experiments() or on the command line to queue trials until the cluster scales up.

You can adjust the resource requests of RLlib agents by setting num_workers and other configs. See the DEFAULT_CONFIG defined by each agent for more info.

The config of this agent is: {'num_sgd_iter': 5, 'model': {'use_lstm': True, 'lstm_use_prev_action_reward': False}, 'env': 'cartpole_stateless'}

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728#issuecomment-452699150, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6SjNQU9Fsvh1Nqyf1h_RpQgJZbmC1ks5vBfDOgaJpZM4Z3L9V .

aGiant commented 5 years ago

Settings: configs = { "PPO": { "num_sgd_iter": 5, }, "IMPALA": { "num_workers": 2, "num_gpus": 0, "vf_loss_coeff": 0.01, }, }

I only have 2 cpu local. Same error if set "num_workers": 1.

ericl commented 5 years ago

One CPU is reserved for the learner, so each worker adds 1.

I think you forgot to set workers for PPO...?

On Wed, Jan 9, 2019, 5:56 AM aGiant notifications@github.com wrote:

Settings: configs = { "PPO": { "num_sgd_iter": 5, }, "IMPALA": { "num_workers": 2, "num_gpus": 0, "vf_loss_coeff": 0.01, }, }

I only have 2 cpu local. Same error if set "num_workers": 1.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728#issuecomment-452704513, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6ShV1f8KPIwd8mNH-zwGbvaD32qD1ks5vBfT6gaJpZM4Z3L9V .

aGiant commented 5 years ago

Ah, you're right and it works. configs = { "PPO": { "num_sgd_iter": 5, "num_workers": 1, }, "IMPALA": { "num_workers": 2, "num_gpus": 0, "vf_loss_coeff": 0.01, }, } == Status == Using FIFO scheduling algorithm. Resources requested: 2/2 CPUs, 0/0 GPUs Memory usage on this node: 8.6/11.4 GB Result logdir: /home/llu/ray_results/test RUNNING trials:

PPO_cartpole_stateless_0: RUNNING

aGiant commented 5 years ago

Tested and no Clip error. All data points were clean and normalized within (-1.0 and 1.0) before calling agent.train(). My question is about restoring and compute_action(). In this stage, the original data points were changed by agent, almost time 10^8. By the way, the example of cartpole lstm contians only training process.

ericl commented 5 years ago

I see, is it possible to attach a script I can run to reproduce this issue?

aGiant commented 5 years ago

Shared files are under： https://www.dropbox.com/sh/p79bjj2ysbmpukp/AAAnn1FLJ6go85Elgw-qvOBXa?dl=0 Locally, I have one 8G redis database. In the link, I changed and tested with random numbers. Errors remain the same as fetching data from database.

Main issue is about restoring the saved LSTM model and executing out of sample test.

Many thanks!

ericl commented 5 years ago

Thanks! I should be able to look into this more Friday.

On Wed, Jan 9, 2019, 11:53 PM aGiant notifications@github.com wrote:

Shared files are under： https://www.dropbox.com/sh/p79bjj2ysbmpukp/AAAnn1FLJ6go85Elgw-qvOBXa?dl=0 Locally, I have one 8G redis database. In the link, I changed and tested with random numbers. Errors remain the same as fetching data from database.

Main issue is about restoring the saved LSTM model and executing out of sample test.

Many thanks!

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728#issuecomment-453002236, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6SoB4kC8_GydDfCnvco4kei36sDnPks5vBvF-gaJpZM4Z3L9V .

ericl commented 5 years ago

@aGiant , I couldn't get your script to run due to the redis data dependencies, but maybe try this example out (building on cartpole_lstm):

Add "checkpoint_freq": 1 to the cartpole_lstm training config to save checkpoints.

Run this:


import ray
from ray import tune
import numpy as np

from ray.rllib.agents.ppo import PPOAgent from ray.rllib.examples.cartpole_lstm import CartPoleStatelessEnv

if name == "main": ray.init() tune.register_env("cartpolestateless", lambda : CartPoleStatelessEnv()) agent = PPOAgent(env="cartpole_stateless", config={ "num_workers": 0, "model": { "use_lstm": True, "lstm_use_prev_action_reward": True, }, })

# Or whereever your checkpoint is
agent.restore("/home/eric/ray_results/test/PPO_cartpole_stateless_0_2019-01-11_22-00-56i68jgju6/checkpoint_6/checkpoint-6")

env = CartPoleStatelessEnv()
acc = []
while True:
    obs = env.reset()
    done = False
    prev_action = np.zeros_like(env.action_space.sample())
    prev_reward = 0
    info = {}
    state = agent.get_policy().get_initial_state()
    total_reward = 0
    while not done:
        action, state, fetch = agent.compute_action(
            obs, state=state, prev_action=prev_action,
            prev_reward=prev_reward, info=info)
        obs, reward, done, info = env.step(action)
        total_reward += reward
        prev_reward = reward
        prev_action = action
    acc.append(total_reward)
    print("Rollout complete, current mean reward", np.mean(acc))



I was able to do this and reproduce the mean training reward of the original agent. My best guess is there is some subtle bug in your rollouts code, so hopefully this helps.

aGiant commented 5 years ago

That example worked, many thanks!

aGiant commented 5 years ago

@ericl Errors remain the same. The values are much more bigger than original values and all predicted actions were not normal, almost all were out of range. In Dropbox, the "myEnv_Copy.py" file was corrected and the errors should be reproduced.

Many thanks!

ericl commented 5 years ago

@aGiant not sure what you mean, when I ran the train.py script with some fixes

if True:
    import numpy as np
    obs = env.reset()
    state = agent.get_policy().get_initial_state()
    prev_rew = 0
    prev_act = np.zeros_like(env.action_space.sample())
    rewards = []
    done = False
    while not done:
        action, state, _ = agent.compute_action(
            obs, state=state, prev_reward=prev_rew, prev_action=prev_act)
        print("action", action)
        obs, rew, done, info = env.step(action)
        rewards.append(rew)
        prev_rew = rew
    print("Episode reward", np.mean(rewards))

I got normal looking actions:

action [ 1.1260735  -1.3211138   0.6528911  -0.22034293 -0.01135487 -0.9091983
  1.4385319  -0.63426673 -0.5652876  -0.5376567   1.9951357  -0.31390783
 -0.70996547  1.7107437   0.96269166  0.73231137 -1.2653104   1.1182361
 -0.05449466  0.59806865 -0.14532986]
action [-0.5608993  -1.0731459   0.44879803  0.44107428 -0.19863613 -0.4424859
 -1.2708163   0.5617583  -0.6878831   1.2864232   0.21852352  1.6283256
 -0.8759136  -0.6002346  -1.4081013  -0.43308827 -0.7667263  -0.28446013
  0.62706804 -0.39129162  1.2432218 ]

aGiant commented 5 years ago

As defined in action_sapces, all actions should be within (-1,1). And from not corrected train.py, the output showed the value in holders were around 1e7 bigger. Those outputs were errors for calling compute_actions() and the inputs of observations were correct.

The weights of trained LSTM were also normal.

ericl commented 5 years ago

Ok, the issue was that the filters weren't synced after exactly 1 iteration, so the divisor was 0. This patch fixes it https://github.com/ray-project/ray/pull/3769 (or, you can train() twice and then the filters will be ok).

For the clipping, you'll have to handle that yourself if using agent.compute_action().

aGiant commented 5 years ago

We have trained 2000 times and the final saved models are almost the same as trained once. Actions were out of ranges.

I did not get the clipping part. All of our original data were within (-1, 1), but after copy to the tensorflow holder in train() or compute_actions() (not sure about it), all data were saved in holders and changed to orgininal data times 1e7.

Till now, we did not find the place for that reason.

ericl commented 5 years ago

The 1e7 is due to the filter standard deviation value being zero, so you end up dividing the observations by epsilon => huge values.

I've checked and that pr does fix the issue. Did you resave a new more? The existing checkpoints have corrupt filter values.

On Mon, Jan 14, 2019, 1:52 AM aGiant notifications@github.com wrote:

We have trained 2000 times and the final saved models are almost the same as trained once. Actions were out of ranges.

I did not get the clipping part. All of our data were within (-1, 1), but after copy to the tensorflow holder, all data were saved in holder and changed to orgininal data times 1e7.

Till now, we did not find the place for that reason.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728#issuecomment-453949375, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6SuID9beKoXIq7Hydda_MrWrLQ2ZQks5vDFN4gaJpZM4Z3L9V .

aGiant commented 5 years ago

Maybe this is the reason for that problem: (in ray/python/ray/rllib/utils/filter.py) if self.destd: x = x / (self.rs.std + 1e-8)

All the original data was timed by 1e8 if self.destd was True or not zero.

ericl commented 5 years ago

*new model

On Mon, Jan 14, 2019, 1:59 AM Eric Liang ekhliang@gmail.com wrote:

The 1e7 is due to the filter standard deviation value being zero, so you end up dividing the observations by epsilon => huge values.

I've checked and that pr does fix the issue. Did you resave a new more? The existing checkpoints have corrupt filter values.

On Mon, Jan 14, 2019, 1:52 AM aGiant notifications@github.com wrote:

We have trained 2000 times and the final saved models are almost the same as trained once. Actions were out of ranges.

I did not get the clipping part. All of our data were within (-1, 1), but after copy to the tensorflow holder, all data were saved in holder and changed to orgininal data times 1e7.

Till now, we did not find the place for that reason.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728#issuecomment-453949375, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6SuID9beKoXIq7Hydda_MrWrLQ2ZQks5vDFN4gaJpZM4Z3L9V .

aGiant commented 5 years ago

@ericl New modes are uploaded to the link: https://www.dropbox.com/sh/p79bjj2ysbmpukp/AAAnn1FLJ6go85Elgw-qvOBXa?dl=0 MyEnv_Copy.py and train.py remain the same as before.

We trained 2000 times the PPO agent, saved under agents_sc. And the actions were almost out of our pre-defined ranges in myEnv_Copy.py.

ericl commented 5 years ago

Yes, exactly.

The bad filter saving in that corner case is fixed in that PR. Or, you can disable the filter during training with "observation_filter": "NoFilter"

On Mon, Jan 14, 2019, 1:59 AM aGiant notifications@github.com wrote:

Maybe this is the reason for that problem: (in ray/python/ray/rllib/utils/filter.py) if self.destd: x = x / (self.rs.std + 1e-8)

All the original data was timed by 1e8 if self.destd was True or not zero.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728#issuecomment-453951164, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6Sv8tGxRk1HUZwZwU5nOImBETVDc5ks5vDFUEgaJpZM4Z3L9V .

ericl commented 5 years ago

As I said before, the actions will not be clipped. You can simply np.clip() them afterwards.

They are now in a reasonable range and not 1e7 right?

On Mon, Jan 14, 2019, 2:08 AM Eric Liang ekhliang@gmail.com wrote:

Yes, exactly.

The bad filter saving in that corner case is fixed in that PR. Or, you can disable the filter during training with "observation_filter": "NoFilter"

On Mon, Jan 14, 2019, 1:59 AM aGiant notifications@github.com wrote:

Maybe this is the reason for that problem: (in ray/python/ray/rllib/utils/filter.py) if self.destd: x = x / (self.rs.std + 1e-8)

All the original data was timed by 1e8 if self.destd was True or not zero.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728#issuecomment-453951164, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6Sv8tGxRk1HUZwZwU5nOImBETVDc5ks5vDFUEgaJpZM4Z3L9V .

aGiant commented 5 years ago

Ah, tested, you're right and now all values are normal. Many thanks!

Is there any other cofingureation that we should pay special attention to?

ericl commented 5 years ago

Great! Not sure about other configs besides the observation filter setting.

buedaswag commented 5 years ago

hi, im running ray on kubernetes with 4 machines with 4 cores each. im following the guide on 'Deploying on Kubernetes', and ive modified the worker and head yaml so that each requests 1 cpu. with this configuration, i want to run the following example with 15 workers:

rllib train --env=CartPole-v1 --run=PPO --config '{"num_workers": 15}' --queue_trials=True --stop '{"episode_reward_mean": 500, "timesteps_total": 200000}'

when i use 'queue_trials=True' i get:

rllib: error: unrecognized arguments: --queue_trials=True

and when i dont use it, i get:

ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 16 CPUs, 0 GPUs but the cluster has only 4 CPUs, 0 GPUs. Pass queue_trials=True in ray.tune.run() or on the command line to queue trials until the cluster scales up.

ive built ray from source.

kubectl get pods

NAME READY STATUS RESTARTS AGE ray-head-f9dfbf6c8-l7fcj 1/1 Running 0 132m ray-worker-785b9c8576-8zmhq 1/1 Running 0 132m ray-worker-785b9c8576-lzsgq 1/1 Running 0 132m ray-worker-785b9c8576-pm2m8 1/1 Running 0 132m

how can i use accomplish the following command:

rllib train --env=CartPole-v1 --run=PPO --config '{"num_workers": 15}' --queue_trials=True --stop '{"episode_reward_mean": 500, "timesteps_total": 200000}'

thank you for all your work!

ericl commented 5 years ago

I think it's now "--queue-trials".

On Wed, May 22, 2019, 8:17 AM buedaswag notifications@github.com wrote:

hi, im running ray on kubernetes with 4 machines with 4 cores each. im following the guide on 'Deploying on Kubernetes', and ive modified the worker and head yaml so that each requests 1 cpu. with this configuration, i want to run the following example with 15 workers:

rllib train --env=CartPole-v1 --run=PPO --config '{"num_workers": 15}' --queue_trials=True --stop '{"episode_reward_mean": 500, "timesteps_total": 200000}'

when i use 'queue_trials=True' i get:

rllib: error: unrecognized arguments: --queue_trials=True

and when i dont use it, i get:

ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 16 CPUs, 0 GPUs but the cluster has only 4 CPUs, 0 GPUs. Pass queue_trials=True in ray.tune.run() or on the command line to queue trials until the cluster scales up.

ive built ray from source.

kubectl get pods

NAME READY STATUS RESTARTS AGE ray-head-f9dfbf6c8-l7fcj 1/1 Running 0 132m ray-worker-785b9c8576-8zmhq 1/1 Running 0 132m ray-worker-785b9c8576-lzsgq 1/1 Running 0 132m ray-worker-785b9c8576-pm2m8 1/1 Running 0 132m

how can i use accomplish the following command:

rllib train --env=CartPole-v1 --run=PPO --config '{"num_workers": 15}' --queue_trials=True --stop '{"episode_reward_mean": 500, "timesteps_total": 200000}'

thank you for all your work!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728?email_source=notifications&email_token=AAADUSWDQDLVJVQT3OIQIFLPWVP2HA5CNFSM4GO4X5K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV7MTNY#issuecomment-494848439, or mute the thread https://github.com/notifications/unsubscribe-auth/AAADUSUIFHJ3RH5Q26WGPK3PWVP2HANCNFSM4GO4X5KQ .

buedaswag commented 5 years ago

Yes, it is. Without the '=True'. Thank you

On Wed, May 22, 2019, 7:26 PM Eric Liang notifications@github.com wrote:

I think it's now "--queue-trials".

On Wed, May 22, 2019, 8:17 AM buedaswag notifications@github.com wrote:

hi, im running ray on kubernetes with 4 machines with 4 cores each. im following the guide on 'Deploying on Kubernetes', and ive modified the worker and head yaml so that each requests 1 cpu. with this configuration, i want to run the following example with 15 workers:

rllib train --env=CartPole-v1 --run=PPO --config '{"num_workers": 15}' --queue_trials=True --stop '{"episode_reward_mean": 500, "timesteps_total": 200000}'

when i use 'queue_trials=True' i get:

rllib: error: unrecognized arguments: --queue_trials=True

and when i dont use it, i get:

ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 16 CPUs, 0 GPUs but the cluster has only 4 CPUs, 0 GPUs. Pass queue_trials=True in ray.tune.run() or on the command line to queue trials until the cluster scales up.

ive built ray from source.

kubectl get pods

NAME READY STATUS RESTARTS AGE ray-head-f9dfbf6c8-l7fcj 1/1 Running 0 132m ray-worker-785b9c8576-8zmhq 1/1 Running 0 132m ray-worker-785b9c8576-lzsgq 1/1 Running 0 132m ray-worker-785b9c8576-pm2m8 1/1 Running 0 132m

how can i use accomplish the following command:

rllib train --env=CartPole-v1 --run=PPO --config '{"num_workers": 15}' --queue_trials=True --stop '{"episode_reward_mean": 500, "timesteps_total": 200000}'

thank you for all your work!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/ray-project/ray/issues/3728?email_source=notifications&email_token=AAADUSWDQDLVJVQT3OIQIFLPWVP2HA5CNFSM4GO4X5K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV7MTNY#issuecomment-494848439 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AAADUSUIFHJ3RH5Q26WGPK3PWVP2HANCNFSM4GO4X5KQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3728?email_source=notifications&email_token=AIC5NA5KHJUYVFCQVXLUV7DPWWFXFA5CNFSM4GO4X5K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV74ENI#issuecomment-494912053, or mute the thread https://github.com/notifications/unsubscribe-auth/AIC5NA3VDVSCJBQCZZ44RMDPWWFXFANCNFSM4GO4X5KQ . [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": " https://github.com/ray-project/ray/issues/3728?email_source=notifications\u0026email_token=AIC5NA5KHJUYVFCQVXLUV7DPWWFXFA5CNFSM4GO4X5K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV74ENI#issuecomment-494912053", "url": " https://github.com/ray-project/ray/issues/3728?email_source=notifications\u0026email_token=AIC5NA5KHJUYVFCQVXLUV7DPWWFXFA5CNFSM4GO4X5K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV74ENI#issuecomment-494912053", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": " https://github.com" } } ]

ray-project / ray