Using HIDIO with image-shaped input

I'm trying to use an enviroment with HIDIO where the state is an image, or has the shape of an image. It seems tricky how to combine image and other input - for example the action. It seems best to first process the state with a convolutional network, and only then combine it with the action, as input for another neural network.

The question might be more generally about the alf framework.

Is there a piece of documentation explaining how to chain inputs and networks in such a way?

Thanks for any answer!

Hi @sunsibar , this is a good question. For this old version of ALF using gin as the configuration, you can specify the input_preprocessors field of your ActorDistributionNetwork or CriticNetwork in the following way:

# Configure the CNN encoding network
cnn/EncodingNetwork.input_tensor_spec=...
cnn/EncodingNetwork.conv_layer_params=...

actor/ActorDistributionNetwork.input_tensor_spec=[%observation_spec, @goal/TensorSpec()]
actor/ActorDistributionNetwork.input_preprocessors=[@cnn/EncodingNetwork(), None]
actor/ActorDistributionNetwork.preprocessing_combiner=@NestConcat()

Each input preprocessor can be either an ALF Network or torch.nn.Module. For details, please see the definition of ActorDistributionNetwork/CriticNetwork.

A more complex example is to fuse different types of sensors:

https://github.com/HorizonRobotics/alf/blob/pytorch/alf/examples/carla.gin

FYI: ALF does have a tutorial and API documentation, but unfortunately it's for the latest version.

https://alf.readthedocs.io/en/latest/

That looks like it solves the problem perfectly, thank you very much!

Edit, it doesn't seem to be so easy:

What needs to be replaced is this (among others):

low_rl_input_specs = @get_low_rl_input_spec(
    observation_spec=%observation_spec,
    action_spec=%action_spec,
    num_steps_per_skill=%num_steps_per_skill,
    skill_spec=%skill_spec)
low_input_preprocessors = @get_low_rl_input_preprocessors(
    low_rl_input_specs=%low_rl_input_specs,
    embedding_dim=%low_hidden_dim)

[...]

low/ActorDistributionNetwork.input_tensor_spec=%low_rl_input_specs
low/ActorDistributionNetwork.input_preprocessors=%low_input_preprocessors
low/ActorDistributionNetwork.preprocessing_combiner=@NestSum(activation=@torch.relu_)

I cannot just keep %observation_spec in the second line above, since get_low_rl_input_spec (asserts the dimension==1, and) multiplies and combines action_spec and observation_spec to a trajectory.

Pre-processing the obersvations seems to clash with combining action+observation into a trajectory.

It seems to me that what's needed is to either:

keep action and observation spaces "raw", un-trajectory-like (not sure where the algorithm then gets the previous actions/observations from though?), apply preprocessing, and insert another preprocessing step somehow somewhere that combines preprocessed observations and actions into a trajectory.
Or, start with the previous trajectory, but write an entirely new, complex preprocessing function that splits up the trajectory, applies one preprocessing network to all the observation steps, and then puts the result together with the actions back into a 'trajectory'.

Is 1. a possibility? 2. definitely seems too complex for us to do right now.

To try to make more sense of this: Where is the code that "populates" the input tensors to the preprocessors? What decides what is input to the low/ActorDistributionNetwork?

Hi @sunsibar , you are right. This issue is actually more complicated. With image inputs (or other high-dimensional inputs), a preprocessing step for the image is necessary to project an image to a very low dimensional encoding, otherwise the skill discriminator might generate useless intrinsic rewards because its neural network can easily be trained to classify different (observation, action) combinations (if without other additional regularizations for the classification loss). This is also true for the original DIAYN method.

For the problems we consider in the HIDIO paper, we didn't explore this more complicated case, as you've already seen in the code that we explicitly asserted assert observation_spec.ndim == 1 and action_spec.ndim == 1 in get_low_rl_input_spec(). We believe that supporting high-dimensional inputs is an interesting direction but might require non-trivial efforts (novel techniques and new experiments) to achieve that .

From an implementation-wise perspective, it is indeed possible to write such code with the current HIDIO codebase. However, this can't be done simply by changing the gin file without modifying HIDIO's source code. (The solution I answered earlier was for a simple AC pipeline that only requires changing the input specs and preprocessors of actor and critic networks. But HIDIO is more complex than that.)

There is a legacy option in HierarchicalAgent called observation_transformer. That was used to transform/preprocess observations but was eventually not used in HIDIO. To start with, I believe you can take a look at that. (Note that it might not work smoothly in its current state.) If you decide to delve into HIDIO's code and make the modification for your case, I'll be happy to answer further questions regarding the code.

======================== To answer your questions:

To try to make more sense of this: Where is the code that "populates" the input tensors to the preprocessors?

ALF defines networks that accept any nested inputs (a nest can be a list, tuple, namedtuple, or dict) and the corresponding input preprocessors to first preprocess the inputs (because they might have different shapes and types), and finally combine them (preprocessing_combiner) to form a flattened vector before going through an MLP. See https://github.com/jesbu1/alf/blob/def59fe39bdbca70a6c80e9b8f2c7c785cb59ea7/alf/networks/preprocessor_networks.py

Both ActorDistributionNetwork and CriticNetwork are a wrapper around EncodingNetwork which is a subclass of PreprocessorNetwork.

What decides what is input to the low/ActorDistributionNetwork?

Let's take rollout_step() of HierarchicalAgent for example,


def rollout_step(self, time_step: TimeStep, state: AgentState):
        """Rollout for one step."""
        new_state = AgentState()
        info = AgentInfo()

        time_step = transform_nest(time_step, "observation",
                                   self._observation_transformer)

        subtrajectory = self._skill_generator.update_disc_subtrajectory(
            time_step, state.skill_generator)

        skill_step = self._skill_generator.rollout_step(
            time_step, state.skill_generator)
        new_state = new_state._replace(skill_generator=skill_step.state)
        info = info._replace(skill_generator=skill_step.info)

        observation = self._make_low_level_observation(
            subtrajectory, skill_step.output, skill_step.info.switch_skill,
            skill_step.state.steps,
            skill_step.state.discriminator.first_observation)

        rl_step = self._rl_algorithm.rollout_step(
            time_step._replace(observation=observation), state.rl)
        new_state = new_state._replace(rl=rl_step.state)
        info = info._replace(rl=rl_step.info)

        skill_discount = ((
            (skill_step.state.steps == 1)
            & (time_step.step_type != StepType.LAST)).to(torch.float32) *
                          (1 - self._skill_boundary_discount))
        info = info._replace(skill_discount=1 - skill_discount)

        return AlgStep(output=rl_step.output, state=new_state, info=info)

Here self._rl_algorithm.rollout_step will receive the assembled low level observation. low/ActorDistributionNetwork is part of the lower level RL algorithm self._rl_algorithm. The calling of low/ActorDistributionNetwork is inside self._rl_algorithm.rollout_step. This logic is similar for predict_step and train_step.

Oh, I didn't even see that problem - distinguishability of actions being 'too easy' in a high-dimensional setting. Maybe this could be offset by not using state-action pairs, but just the sequence of actions (but this isn't an option so far). Or a weak discriminator. But I see how it would probably be best to train a separate encoder and then apply the intrinsic loss based on a low-dimensional encoding.

Our "image" is only 7x7x3 (gym-minigrid, smallest environment), so we're currently using it as a 147-dimensional feature vector. But this is probably still way too many dimensions.

One other problem I see is that the low level discriminator could actually make inferences about the option from the starting state, if the high-level policy is consistent in choosing options. In general it might be better if the discriminator does not get the same input as the high-level policy network. An option could be to down-sample the image for the discriminator. But it seems to work nonetheless for some environments...

Thank you very much for your answer, for the comments on image input-related problems, and for the references to 'where the inputs are inserted. I'm not sure whether we will be able to try to modify hidio further, but if so this might be very helpful. Thank you!

jesbu1 / hidio

Using HIDIO with image-shaped input #7