ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.16k stars 5.61k forks source link

[rllib] MeanStdFilter value problem during compute action #2707

Closed whikwon closed 6 years ago

whikwon commented 6 years ago

System information

### Describe the problem

When I run agents have been trained on for evaluation, I found that the agent doesn't work well like the rewards I've monitored.

And I've found that when I restore agent, the filter(MeanStdFilter) has somewhat strange values. Have you ever heard of such a problem?

Source code / logs

agent = cls(env="prosthetics", config=config)
agent.restore(checkpoint_path)
print(agent.local_evaluator.filters)
>>> {'default': MeanStdFilter((158,), True, True, None, (n=128033777, mean_mean=-1.0975956375310988e+181, mean_std=inf), (n=0, mean_mean=0.0, mean_std=0.0))}
whikwon commented 6 years ago

Is there any way to manage rolling means and stds for each coord of observation?

whikwon commented 6 years ago

Sorry, I've found it. MeanStdFilter handles rolling means and stds for each coord of observation.

How can i prevent inf values in MeanStdFilter?

ericl commented 6 years ago

@richardliaw is it possible to get an inf during filter merges without some observation having an inf?

whikwon commented 6 years ago

@ericl I think I've found the problem. MeanStdFilter referred from the blog https://www.johndcook.com/blog/standard_deviation/ and I think the logic calculating self._S is not same as the formula.

ericl commented 6 years ago

What is the problem?

whikwon commented 6 years ago

I think code calculating self._S should look like this. Anyway, that might not cause inf problem... hmm

https://github.com/whikwon/ray/blob/a58347bce12cb0fffcd8056b087f9090d8af3154/python/ray/rllib/utils/filter.py#L84-L87

ericl commented 6 years ago

You can probably add a print() to determine what update causes it to reach an inf value.

richardliaw commented 6 years ago

cc @eugenevinitsky

eugenevinitsky commented 6 years ago

Are your actions and states also exploding in value?

whikwon commented 6 years ago

@eugenevinitsky I'm checking now. The variance of some values ​​appears to be very large and those might cause the problem.

rl-2 commented 6 years ago

I just wonder if this issue has been fixed or not? I've run into the same issue that the evaluation didn't perform as it supposed to be when the MeanStdFilter was used in the training process.

ericl commented 6 years ago

@RodgerLuo could you log the values of the MeanStdFilter and check if they seem to reasonably reflect the observation inputs? A good place to do this is in FilterManager.synchronize, or you can do it in the filter class itself.

whikwon commented 6 years ago

@RodgerLuo Which environment have you used for training? Is there any abnormal feature in the env?

rl-2 commented 6 years ago

@whikwon @ericl Thanks for all the guidance. So here is what I've found: The environment is Pendulum-v0. In agent.compute_action, I log the values of obs before and after the filter. As you will see below, the filtered values are either extremely small or large.

[-0.87094525  0.49138007 -2.64594034]
 [-8.70945248e+07  4.91380071e+07 -2.64594034e+08]

 [-0.81813912  0.57502033 -1.97910756]
 [-8.18139124e+07  5.75020325e+07 -1.97910756e+08]

 [-0.78058041  0.62505538 -1.25146981]
 [-7.80580405e+07  6.25055383e+07 -1.25146981e+08]

 [-0.74561678  0.66637499 -1.08267827]
 [-7.45616776e+07  6.66374987e+07 -1.08267827e+08]

 [-0.71548291  0.69863024 -0.88289703]
 [-71548290.6009737   69863023.92595541 -88289703.02494954]

 [-0.69208157  0.7218193  -0.65892435]
 [-69208156.94613262  72181929.95562999 -65892435.08048297]

 [-0.67686169  0.73611021 -0.41755988]
 [-67686169.49385323  73611021.31643993 -41755987.61376046]

 [-0.67074812  0.74168521 -0.16547722]
 [-67074812.32289593  74168521.30013305 -16547721.62643052]

 [-0.67422883  0.73852251  0.09405967]
 [-67422882.58069217  73852250.50403133   9405966.79778121]

 [-0.68740092  0.72627817  0.35968684]
 [-68740091.88581493  72627816.76141532  35968683.70469721]

So it seems like the filter is not applied correctly in the code below:

 filtered_obs = self.local_evaluator.filters[policy_id](
    observation, update=False) 
eugenevinitsky commented 6 years ago

Ah, that's the issue. Until update=True in the filter at least once, the initial observation are going to blow up passing through it

ericl commented 6 years ago

This is a trained agent though right? So presumably the filter should have a valid value even with update=False, unless the state was not restored correctly. Maybe we should add a check that we don't try to apply un-initialized filters.

Btw which algorithm is this?

rl-2 commented 6 years ago

Yes, it's a trained agent, and I used APEX DDPG to train.

ericl commented 6 years ago

Could you throw in a print(self.local_evaluator.filters[policy_id].rs)? In particular I'm wondering if you're seeing n=0 (num samples), since I'm having a hard time reproducing this (I always see n > 0, e.g., (n=1490, mean_mean=-0.5082556000843493, mean_std=1.887412412375233) after restoring with DDPG).`

ericl commented 6 years ago

Ah, I see in @whikwon 's initial post that n > 0, but mean_std is infinity. So the question is whether the inf value is there already or is some bug in restoring the checkpoint.

To help confirm the issue, it would be great to get:

If there is some script I can run (in a few min) to reproduce that would be ideal.

whikwon commented 6 years ago

OK, I'll make a log and share you. It might take few days to reproduce the error.

rl-2 commented 6 years ago

@ericl From my end, after throwing in a print(self.local_evaluator.filters) right before action = agent.compute_action(state), I've got this:

filters: {'default': MeanStdFilter((3,), True, True, None, (n=0, mean_mean=0.0, mean_std=0.0), (n=0, mean_mean=0.0, mean_std=0.0))}

To reproudce the error, what I've done is to trian with the following hyper-parameters and evaluate a checkpoint. Please let me know if you can see the same error.

pendulum-apex-ddpg:
    env: Pendulum-v0
    run: APEX_DDPG
    checkpoint_freq: 1
    stop:
        training_iteration: 5 
    config:
        use_huber: True
        clip_rewards: False
        num_workers: 3
        n_step: 1
        target_network_update_freq: 50000
        tau: 1.0
        observation_filter: "MeanStdFilter"
        optimizer:
            num_replay_buffer_shards: 3
ericl commented 6 years ago

Thanks @RodgerLuo , I was able to reproduce and fix the issue here: https://github.com/ray-project/ray/pull/2791

The problem was that in APEX the local filter was never updated, and we didn't do global filter synchronization.

This seems to be separate from the problem seen by @whikwon