tensorflow / models

Models and examples built with TensorFlow
Other
76.8k stars 45.83k forks source link

Efficient HRL: Feature / Documentation Request. Steps needed to adjust HIRO to a new environment #10430

Open sunsibar opened 2 years ago

sunsibar commented 2 years ago

1. The entire URL of the file you are using:

https://github.com/tensorflow/models/tree/master/research/efficient-hrl

2. Describe the feature you request

Similar to issue #10384 , I would like to use efficient-hrl for a new environment. I would like to start a list of steps to take here, by listing what I did so far and what still does not seem to function. I would be grateful for corrections and help with the parts that do not work yet.

3. Additional context

The environment I want to add is very different from the ant environments, i.e., I will need to use the environment reward instead of the ant-specific reward built into HIRO.

What helped me was the comment in issue #10384 , that these parameters needs to be adjusted in the .gym file:

While this can be set to zero, as it should not be used when using plain_rewards as meta loss: meta_context_range = ((0, 0), (0, 0))

My changes to the .gin environment configuration:

Config rewards

[ preserved ]

Config samplers

[ preserved ]

Also preserved, but uncertain what these mean:

eval1/ConstantSampler.value = [16, 0] eval2/ConstantSampler.value = [16, 16] eval3/ConstantSampler.value = [0, 16]


### Code changes:
- In `create_maze_env.py`, in function `create_maze_env`, add an option that creates the new environment  
- In `agent.py`, in `cond_begin_episode_op`, ... set `meta_reward = rewards` (leave context_reward as is). 
- In `train.py`, in `collect_experience`, changing the lines after `next_reset_episode_cond = tf.logical_or(` to detect the proper termination criterion): 
next_reset_episode_cond = tf.logical_or(
        tf.less(0.0, reward), # or in whatever way you can determine early termination in your env
       agent.reset_episode_cond_fn(
            state, action,
            transition_type, environment_steps, num_episodes))
  Maybe you could use transition_type here, too - it is 1 ususally, 0 at the start of an episode, and 2 at the end of it. But I was unsure whether transition_type detects the environments early termination condition. 
- (I did not change `step_cond` - is that necessary? Why is it necessary to determine whether to increase the step count?)
- *In `samplers.py`, I commented out the following two assertions: 
  #assert spec.shape.as_list()[0] == len(context_range[0])
  #assert spec.shape.as_list()[0] == len(context_range[1])
- *in `context.py`, added an if/else to set `self.context_as_action_specs = tuple([ ...` : 
if len(self._context_shapes) == 1:
  self.context_as_action_specs = tuple([
    specs.BoundedTensorSpec(
      shape=self._context_shapes[0],
      dtype=(tf.float32 if self._obs_spec.dtype in
                           [tf.float32, tf.float64] else self._obs_spec.dtype),
      minimum=self.context_ranges[0],
      maximum=self.context_ranges[-1])
  ])
else: [the original statement]
```

Those with an asterisk are probably unneccessary, and represent my attempts to make shapes compatible with ranges.*

Remaining problems

4. Are you willing to contribute it? Yes

ofirnachum commented 2 years ago

It is hard to say whether and how well HIRO will work out-of-the-box on a new environment, but here are my immediate reactions to your post:

sunsibar commented 2 years ago

Thank you for your comments on what will likely go wrong / would need to be adjusted. It seems we would need to address the second point, especially, if we want to try to make this work, and that a major modification would be necessary. That's all helpful to know.