Closed matej-macak closed 4 years ago
I can't answer your question, but could you try plotting by number of timesteps on the x-axis? Maybe sampling in ray is slower than it is in other libraries
Also - try reducing "num_sgd_iter" for PPO (the default is 30 which seems high), and try PPO with Tensorflow..
@matej-macak thanks for filing this! It seems to be simply default hyperparam related.
1) Could you try setting the kl_coeff
in your config to 0.0?
2) Also, it's probably better to use no gamma at all since it is a context-less env (set gamma
to 0.0).
Please let us know, whether this works.
Hi all,
thank you very much - really amazing community for such a quick set of answers. Here are the tests that I have run:
In this test, I have set the kl_coeff
to 0.0
and gamma
to 0.0
. Agreed that this is a context-less environment but wanted to use like for like (as the stable-baselines
. Definitely helped with convergence although still a bit slower. This got me going on a good trajectory though, I could probably use tune
to find a set of coefficients that could increase this even further. Is there any reason why the
I set the num_sgd_iter
to 15 (not sure what the right one should be in this case).
@regproj - Yes I have noticed that the timesteps are lower than in the stable-baselines
case but what I am interested in is the wall time not the timestep efficiency (i.e. for the same level of resources what is the speed I want to get).
Overall, I find ray amazing and it definitely outperforms stable-baselines
on many atari
benchmarks I have tested. As the problem I am trying to solve is more similar to the TestEnv
class, I wanted to solve this toy example before deploying cluster using this.
Hey @matej-macak , actually, yeah, I noticed too a very slow learning convergence having to do with the (continuous) action space being bounded for PPO. In the case of bounded cont. action, we simply clip the output actions before sending them to the env. I'll look into this further.
Does the action distribution for ppo (or ddpg, sac, etc.) start off as a unit gaussian? If so, would it be better for convergence if we set the environment action bounds to something like -1,1 and rescaled it inside the environment?
I have noticed that part of the slowness can be explained with the parameters tuning (i.e. the system is very sensitive to train_batch_size
, num_sgd_iter
, rollout_fragment_length
and sgd_minibatch_size
. I am assuming that given these are deterministic and one step environments it is better to not have a final update batch size larger than the number of workers as the batch probably contains a high number of repetitive actions and steps which leads to slower training.
Lowering these parameters, however, leads to a quicker memory leak and crash which I reported here in #8473
I think I found a quite satisfying hyperparam solution, which converges quite fast now (within <1min). There will be a PR today that also gets rid of the Box-limit problem in our PPO (Box(0.0, 1.0) learns ok, but e.g. Box(1.0, 3.0) doesn't). ...
Here is the config, that works quite well on my end now. It basically simplifies everything a lot. But then again, it's also a very simple env.
config = {
"num_workers": 0,
"entropy_coeff": 0.00001,
"num_sgd_iter": 4,
"vf_loss_coeff": 0.0,
#"vf_clip_param": 100.0,
#"grad_clip": 1.0,
"lr": 0.0005,
# State doesn't matter -> Set both gamma and lambda to 0.0.
"lambda": 0.0,
"gamma": 0.0,
"clip_param": 0.1,
"kl_coeff": 0.0,
"train_batch_size": 64,
"sgd_minibatch_size": 16,
"normalize_actions": True,
"clip_actions": False,
# Use a very simple Model for faster convergence.
"model": {
"fcnet_hiddens": [8],
},
"use_pytorch": [True/False],
}
Closing this issue.
On another note: It could also be that baselines treats the action space as type=int ... in which case it would be easier to reach the 1000 reward, b/c you only had two choices (0 and 1) for each action component.
Hi @sven1977, the action_space
is of type==float
. I have checked the action
vector in the step and it is not producing binary results. I think the hyperparameter search definitely helps. I have been working on a similar, more complex, problem that inspired this question and my finding is that rllib
seems to have a more stable but slower convergence than stable baselines
that is faster but can get stuck in local minima a bit more easily.
I was reading this issue and wondered if there was information on the Rllib website about choosing the HPs. I don't think I found on the documentation that if the state doesn't matter we can set both gamma and lambda to 0.0.
The RLLib converges slowly on a simple environment compared to comparable algorithms with different libraries under same conditions (see below the results). Is this something that is expected or is there an approach that I overlooked when running the program?
The environment itself is trivial (reward is simply the sum of all the continuous actions) yet the RLlib algorithm takes a fairly long time to converge. I have tried different things like playing with learning rate and other parameters but none of them actually changed the very gradual and slow learning process.
Steps to reproduce the issue:
RLlib code
Stable baselines code