sunsibar commented 2 years ago

1. The entire URL of the file you are using:

https://github.com/tensorflow/models/tree/master/research/efficient-hrl

2. Describe the feature you request

Similar to issue #10384 , I would like to use efficient-hrl for a new environment. I would like to start a list of steps to take here, by listing what I did so far and what still does not seem to function. I would be grateful for corrections and help with the parts that do not work yet.

3. Additional context

The environment I want to add is very different from the ant environments, i.e., I will need to use the environment reward instead of the ant-specific reward built into HIRO.

What helped me was the comment in issue #10384 , that these parameters needs to be adjusted in the .gym file:

meta/Context.reward_fn = @task/plain_rewards
update the state space ranges (gear ranges in issue #10384)
adjust termination conditions. (It should in my case detect a termination signal from the environment.)
adjust task/negative_distance/etc reward function options

While this can be set to zero, as it should not be used when using plain_rewards as meta loss: meta_context_range = ((0, 0), (0, 0))

My changes to the .gin environment configuration:

create_maze_env.env_name = ...
- context_range = (0, 256) ( I had to drop some assertions in the code to make context_range, context_shape, and meta_context_range compatible with each other. Why does meta_context_range consist of two tuples originally?
  - todo: Try to get back to the two-tuple format
- *meta_context_range = (0, 0)
SUBGOAL_DIM = <size of 1-dimensional state space>
RESET_EPISODE_PERIOD = <maximum duration of an episode in new env>

meta/Context.reward_fn = @task/plain_rewards


## Config agent context
agent/Context.context_ranges = (0, 256)#[%context_range]
agent/Context.context_shapes = [None]
[...]
meta/Context.context_ranges = (0, 256) # range of image pixel values
meta/Context.context_shapes = [%SUBGOAL_DIM]

Config rewards

[ preserved ]

Config samplers

[ preserved ]

Also preserved, but uncertain what these mean:

eval1/ConstantSampler.value = [16, 0] eval2/ConstantSampler.value = [16, 16] eval3/ConstantSampler.value = [0, 16]


### Code changes:
- In `create_maze_env.py`, in function `create_maze_env`, add an option that creates the new environment  
- In `agent.py`, in `cond_begin_episode_op`, ... set `meta_reward = rewards` (leave context_reward as is). 
- In `train.py`, in `collect_experience`, changing the lines after `next_reset_episode_cond = tf.logical_or(` to detect the proper termination criterion):

next_reset_episode_cond = tf.logical_or(
        tf.less(0.0, reward), # or in whatever way you can determine early termination in your env
       agent.reset_episode_cond_fn(
            state, action,
            transition_type, environment_steps, num_episodes))

  Maybe you could use transition_type here, too - it is 1 ususally, 0 at the start of an episode, and 2 at the end of it. But I was unsure whether transition_type detects the environments early termination condition. 
- (I did not change `step_cond` - is that necessary? Why is it necessary to determine whether to increase the step count?)
- *In `samplers.py`, I commented out the following two assertions:

  #assert spec.shape.as_list()[0] == len(context_range[0])
  #assert spec.shape.as_list()[0] == len(context_range[1])

- *in `context.py`, added an if/else to set `self.context_as_action_specs = tuple([ ...` :

if len(self._context_shapes) == 1:
  self.context_as_action_specs = tuple([
    specs.BoundedTensorSpec(
      shape=self._context_shapes[0],
      dtype=(tf.float32 if self._obs_spec.dtype in
                           [tf.float32, tf.float64] else self._obs_spec.dtype),
      minimum=self.context_ranges[0],
      maximum=self.context_ranges[-1])
  ])
else: [the original statement]
```

in eval_utils.py, in compute_average_rewards: Replace the success condition, for example: success = (s_reward > 0)

Those with an asterisk are probably unneccessary, and represent my attempts to make shapes compatible with ranges.*

Remaining problems

RandomSampler objects are still being called. I do not understand why, since the sub-goal should not be sampled but set by the high-level ( meta) controller. And high-level goals do not need to be sampled, the environment takes care of that (also of re-setting).
It doesn't seem to converge, not even on a very simple environment.
I'm stumped how to set environment_steps to zero or how to find another way to detect the number of steps since the last start of an episode. If an episode stops early, the modulo-check 'every_n_steps' (in cond.py) seems to lose its validity.

4. Are you willing to contribute it? Yes

ofirnachum commented 2 years ago

It is hard to say whether and how well HIRO will work out-of-the-box on a new environment, but here are my immediate reactions to your post:

Yes, using task/plain_rewards is the right way to go.
I would be hesitant to use an environment with its own termination signal. The code was not tested with these sorts of environments (the code handles termination on its own), and I suspect there may be unintended behavior if the environment terminates unexpectedly.
I am not sure if setting context_range to (0, 256) and shape to None is correct. For HIRO, the context_range should be the expected range of environment observations, and should have the same shape as these observations (or subset of observations, if you only look at a subset).
The eval1/ConstantSampler.value's are meta-contexts used during the evaluator loop. For your setting, since you do not use any meta-contexts, you probably want to set them all to 0.