Flexible Input Image Size Support for UniZero: Implementation Timeline and Contribution Opportunity

Tiikara commented 3 months ago

I'm currently conducting experiments with the UniZero neural network using your library. I'm interested in the timeline for implementing flexible input image size support in UniZero, given the presence of related TODOs in the codebase. If this feature isn't planned for the near future, I'd be willing to contribute to its development.

From my understanding, adapting RepresentationNetworkUniZero, DownSample, LatentDecoder, and WorldModel would be necessary. Could you provide guidance on any other components that might need modification to achieve this functionality? Additionally, if you're open to external contributions for this feature, I'd appreciate any specific guidelines or considerations for its implementation.

Thank you for your time and continued development of this valuable library.

puyuan1996 commented 3 months ago

Hello, the current implementation of UniZero should be compatible with different image observation shapes. You can specify the observation_shape item in the configuration file for testing. If any errors occur, please sync the issue in this issue area, and we will address them gradually. The TODO you mentioned likely refers to the fixed output_shape=(3, 64, 64) when calling LatentDecoder. We will correct this issue in the future. However, under the default settings that do not use decode-related loss, this should not have any impact. We highly encourage you to further adapt the handling of observation_shape in UniZero and submit related PRs. If you have any questions, feel free to discuss them.

Tiikara commented 3 months ago

Thank you for your response. I attempted to set the observation_shape to (3, 128, 128) in both the env and model configurations. However, I encountered an error with the following stack trace:

traceback.txt

https://github.com/opendilab/LightZero/blob/fff7fdebda839c8f3b95ab46a158a8978f7b7e78/lzero/model/common.py#L259-L263

I identified the problematic lines and modified them as follows:

if self.observation_shape[1] == 64:
    output = x
elif self.observation_shape[1] == 96:
    x = self.pooling2(x)
    output = x
else:
    x = self.pooling2(x)
    output = x

This modification allowed the training to start. However, I have some questions and observations:

It appears that the number of layers in the DownSample network is currently hardcoded. To properly extract features from higher resolution images, wouldn't it be more effective to calculate the number of layers dynamically based on the input size? I'm curious about the use of AvgPool2d for self.observation_shape[1] == 96 (presumably for 96x96 images). Is this primarily for implementation simplicity and speed, or is there a specific reason for using average pooling at the output layer? Could this be handled by resizing the image in the environment instead?
If the pooling is indeed for optimization, we could potentially alternate convolutional layers with AvgPool2d to achieve the desired output size. However, it's unclear what the target output size should be. The current implementation seems quite rigid in this respect.
What are your thoughts on making the network more flexible to accommodate various input sizes? What should the desired output shape be after applying the DownSample network?

I appreciate your insights on these points and any guidance on how we might approach making the network more adaptable to different input sizes.

puyuan1996 commented 3 months ago

Hello, regarding the use of AvgPool2d to reduce the size of the output feature map to a fixed value, currently set to (64*8*8) for the 96x96 input image, this approach adheres to the model definition outlined in the MuZero paper.

However, as you rightly observed, the current implementation contains numerous specific operations and lacks generality. To enhance the flexibility of the implementation, we can mandate the environment to preprocess images to a standardized size (such as 96 or 64). Subsequently, we can leverage the num_resblocks parameter in the DownSample class to regulate the network's capacity. Furthermore, at this line and this line, we can dynamically define and utilize the final layer based on the specified feature plane size.

If you are interested in optimizing the DownSample implementation in this manner and submitting a pull request, it would be greatly appreciated. You can start your development from this PR: https://github.com/opendilab/LightZero/pull/254.

Tiikara commented 3 months ago

Thank you for your detailed response. I'd like to propose a potential enhancement to the network architecture, focusing on the flexibility of channel counts in convolutional layers. While the current channel configuration may suffice for simpler environments like Pong, more complex scenarios might benefit from a more adaptable approach.

I suggest implementing configurable channel counts for each convolutional layer in the ResBlocks. This could potentially allow the network to capture more intricate structures in the input. My proposed structure would look something like this:

ResBlock (out_channels=configurable)
AvgPool2d
ResBlock (out_channels=increased channel count)
AvgPool2d ... (repeat as necessary)
Final ResBlock (out_channels=final channel count)
AvgPool2d
Output: 1x1xfinal channel count

Following this, the RepresentationNetworkUniZero could process this output:

ResBlock (out_channels=reduced channel count) // 1x1 conv
ResBlock (out_channels=further reduced channel count) // 1x1 conv

If I understand correctly, the primary goal of RepresentationNetworkUniZero is to transform the image into a state vector for use in the transformer.

I have a few questions regarding this approach:

Is my understanding of the network's purpose and structure accurate?
Have you conducted experiments with similar architectures or varying channel counts?
How is backpropagation handled across these networks? Is it performed simultaneously for all networks?

I appreciate your insights on these points and any feedback on the proposed modifications.

puyuan1996 commented 3 months ago

Yes, the representation network currently encodes the raw image into a high-level latent state that is highly relevant to decision-making semantics.
We have not previously conducted experiments on similar transformations of the representation network structure, but we believe that the structure of the representation network is crucial for final performance and stability. We welcome you to conduct more experiments and analyses in this area.
Yes, the representation network is jointly trained using formula (3) from the UniZero paper.

Thank you for your attention.

opendilab / LightZero

Flexible Input Image Size Support for UniZero: Implementation Timeline and Contribution Opportunity #253