Open sunsibar opened 2 years ago
It is hard to say whether and how well HIRO will work out-of-the-box on a new environment, but here are my immediate reactions to your post:
Thank you for your comments on what will likely go wrong / would need to be adjusted. It seems we would need to address the second point, especially, if we want to try to make this work, and that a major modification would be necessary. That's all helpful to know.
1. The entire URL of the file you are using:
https://github.com/tensorflow/models/tree/master/research/efficient-hrl
2. Describe the feature you request
Similar to issue #10384 , I would like to use efficient-hrl for a new environment. I would like to start a list of steps to take here, by listing what I did so far and what still does not seem to function. I would be grateful for corrections and help with the parts that do not work yet.
3. Additional context
The environment I want to add is very different from the ant environments, i.e., I will need to use the environment reward instead of the ant-specific reward built into HIRO.
What helped me was the comment in issue #10384 , that these parameters needs to be adjusted in the .gym file:
meta/Context.reward_fn = @task/plain_rewards
While this can be set to zero, as it should not be used when using plain_rewards as meta loss:
meta_context_range = ((0, 0), (0, 0))
My changes to the .gin environment configuration:
create_maze_env.env_name = ...
context_range = (0, 256)
( I had to drop some assertions in the code to make context_range, context_shape, and meta_context_range compatible with each other. Why doesmeta_context_range
consist of two tuples originally?meta_context_range = (0, 0)
SUBGOAL_DIM = <size of 1-dimensional state space>
RESET_EPISODE_PERIOD = <maximum duration of an episode in new env>
meta/Context.reward_fn = @task/plain_rewards
Config rewards
[ preserved ]
Config samplers
[ preserved ]
Also preserved, but uncertain what these mean:
eval1/ConstantSampler.value = [16, 0] eval2/ConstantSampler.value = [16, 16] eval3/ConstantSampler.value = [0, 16]
eval_utils.py
, incompute_average_rewards
: Replace the success condition, for example:success = (s_reward > 0)
Those with an asterisk are probably unneccessary, and represent my attempts to make shapes compatible with ranges.*
Remaining problems
cond.py
) seems to lose its validity.4. Are you willing to contribute it? Yes