shamilmamedov / flexible_arm

3 stars 0 forks source link

Tuning of the RL and IRL algorithms #27

Open Erfi opened 1 year ago

Erfi commented 1 year ago

Description: The RL and IRL algorithms need tuning to perform well (especially the Adversarial ones). We need to put some time and tune them and see if they can perform well if we want to use them as baselines.

Acceptance Criteria: Tune the hyper parameters such that they are at least acting reasonably so they can be used as baselines. You can re-run the trained models and see how they perform visually. Record their hyperparameters.

Erfi commented 1 year ago

rl_sac_ppo Just an update on this front. Originally PPO was used as the main algorithm for GAIL, AIRL and Density. Non-of these algorithms ended up learning how to Imitate the expert. For sanity check I ran SAC vs PPO of the FlexibleArmEnv as RL algorithms (not for imitation, but to learn from the L2 distance to the goal that we are using as reward). This plot suggests that PPO is just unable to learn anything useful, SAC on the other hand can learn a mid-level policy (in the reward plot we want to get as close to zero as possible). NOTE: That to achieve this performance SAC had to be run for 2M steps (~30hours)

Takeaways: - Change the learning algorithm of the IRL algorithms from PPO to SAC - Let them run for > 2M steps and visualize the performance in tensorboard

Erfi commented 1 year ago

Just an update for the RL methods: This is currently SAC after 2M steps with the latest observation_space (current_state, current_pos, goal_pos, wall) sac_2M.webm

shamilmamedov commented 1 year ago

It looks quite OK now. I am a bit puzzled that the SAC moves the end-effector close enough to the goal but doesn't manage to reach it. Any ideas why?

Erfi commented 1 year ago

I wonder about it too, it might get better with more training but I have limit it to 2M steps for now. Will run a couple more with slightly higher Learning Rate to see if I can squeeze better performance without increasing the number of training steps. I am speculating here but since we are using an L2 distance as reward the incentive to get closer to the goal in decreasing quadratically.

shamilmamedov commented 1 year ago

But MPC uses the same reward and manages to get to the goal...

Can you also constrain the outputs of the SAC using scaled tanh ? And perhaps add a reward for smoothness of the controls or penalize for non-smoothness?