Closed drozzy closed 5 years ago
Hierarchical Actor Critic is ONLY for continuous tasks (Hier-Q, as described in the paper, is for discrete tasks, but is NOT implemented in this repo).
Coming back to your question of defining goal,
The state space of pendulum task in the official gym implementation includes the variables [angular velocity, sine theta, cosine theta]
. This state is difficult for the hierarchical policies to predict, since they also need to learn the relation between sine and cosine to predict the right goal state. So, I have modified the state space to include only [angular velocity, normalized theta]
. This gives reasonable performance, although its not consistent. (the modified file is available in the gym folder)
I tried with the following hyerparameters:
#################### Hyperparameters ####################
env_name = "Pendulum-v0"
save_episode = 10 # keep saving every n episodes
max_episodes = 1000 # max num of training episodes
random_seed = 0
render = True
env = gym.make(env_name)
state_dim = 2
action_dim = env.action_space.shape[0]
# primitive action bounds and offset
action_bounds = env.action_space.high[0]
action_offset = np.array([0.0])
action_offset = torch.FloatTensor(action_offset.reshape(1, -1)).to(device)
action_clip_low = np.array([-1.0 * action_bounds])
action_clip_high = np.array([action_bounds])
# state bounds and offset
state_bounds_np = np.array([np.pi, 8.0])
state_bounds = torch.FloatTensor(state_bounds_np.reshape(1, -1)).to(device)
state_offset = np.array([0.0, 0.0])
state_offset = torch.FloatTensor(state_offset.reshape(1, -1)).to(device)
state_clip_low = np.array([-np.pi, -8.0])
state_clip_high = np.array([np.pi, 8.0])
# exploration noise std for primitive action and subgoals
exploration_action_noise = np.array([0.1])
exploration_state_noise = np.array([np.deg2rad(10), 0.4])
goal_state = np.array([0, 0]) # final goal state to be achived
threshold = np.array([np.deg2rad(10), 0.05]) # threshold value to check if goal state is achieved
# HAC parameters:
k_level = 2 # num of levels in hierarchy
H = 20 # time horizon to achieve subgoal
lamda = 0.3 # subgoal testing parameter
# DDPG parameters:
gamma = 0.95 # discount factor for future rewards
n_iter = 100 # update policy n_iter times in one DDPG update
batch_size = 100 # num of transitions sampled from replay buffer
lr = 0.001
# save trained models
directory = "./".format(env_name, k_level)
filename = "HAC_{}".format(env_name)
#########################################################
Thanks, exactly what I was looking for.
I’m just curious how did you define a goal for the pendulum task because it is continuous task in nature?