watakandai / hiro_pytorch

Implementation of HIRO (Data-Efficient Hierarchical Reinforcement Learning)
90 stars 20 forks source link

Sampling of Subgoal #3

Open Markus28 opened 3 years ago

Markus28 commented 3 years ago

While going through your code I came across a line that seemed odd to me: In hiro/hiro_utils.py:

def sample(self):
    return (self.high - self.low) * np.random.sample() + self.low

although the point of the class isn't entirely clear to me, I would have expected this method to return a sample of a random vector with independent, (non-identically) uniformly distributed random variables.

However, np.random.sample() only returns a single float and the random variables in the vector will therefore not be independent.

I would have expected the method to be something like this:

def sample(self):
    return (self.high - self.low) * np.random.sample(self.high.shape) + self.low

Did I misunderstand the point of sample or is this a mistake?

peasant98 commented 3 years ago

Good catch @Markus28, I can raise a PR for it. Thanks!

watakandai commented 3 years ago

I think you are right. We should change it so that we sample independently.

Markus28 commented 3 years ago

Great :) I suppose that instead of using np.random.sample, the following would be more straight-forward:

def sample(self):
    return np.random.uniform(low = self.low, high = self.high)

I have a second question which is tangentially related: The code in SubgoalActionSpace appears to be specifically pertaining to the ant environments, is that correct? I am currently trying adapt your code to work with an entirely different environment and would be grateful for some pointers on whether this is the only part of the code that requires changes. Thanks!

peasant98 commented 3 years ago

For SubgoalActionSpace, yea, you'd have to change the dim and limits. (Check out the below line)

https://github.com/watakandai/hiro_pytorch/blob/master/hiro/hiro_utils.py#L109

peasant98 commented 3 years ago

For the environment, you'll need the .step() function to return a tuple of obs, r, done, and info -- but obs is a dict with the keys of 'desired_goal', 'observation', and 'achieved_goal'. This might be a little different than some Gym environments.

peasant98 commented 3 years ago

I've raised a small PR to allow a bit more modularity in subgoals: https://github.com/watakandai/hiro_pytorch/pull/5