Closed Tiikara closed 3 months ago
Hello, the current implementation of UniZero should be compatible with different image observation shapes. You can specify the observation_shape
item in the configuration file for testing. If any errors occur, please sync the issue in this issue area, and we will address them gradually. The TODO you mentioned likely refers to the fixed output_shape=(3, 64, 64) when calling LatentDecoder. We will correct this issue in the future. However, under the default settings that do not use decode-related loss, this should not have any impact. We highly encourage you to further adapt the handling of observation_shape in UniZero and submit related PRs. If you have any questions, feel free to discuss them.
Thank you for your response. I attempted to set the observation_shape to (3, 128, 128) in both the env and model configurations. However, I encountered an error with the following stack trace:
I identified the problematic lines and modified them as follows:
if self.observation_shape[1] == 64:
output = x
elif self.observation_shape[1] == 96:
x = self.pooling2(x)
output = x
else:
x = self.pooling2(x)
output = x
This modification allowed the training to start. However, I have some questions and observations:
It appears that the number of layers in the DownSample network is currently hardcoded. To properly extract features from higher resolution images, wouldn't it be more effective to calculate the number of layers dynamically based on the input size? I'm curious about the use of AvgPool2d for self.observation_shape[1] == 96 (presumably for 96x96 images). Is this primarily for implementation simplicity and speed, or is there a specific reason for using average pooling at the output layer? Could this be handled by resizing the image in the environment instead?
If the pooling is indeed for optimization, we could potentially alternate convolutional layers with AvgPool2d to achieve the desired output size. However, it's unclear what the target output size should be. The current implementation seems quite rigid in this respect.
What are your thoughts on making the network more flexible to accommodate various input sizes? What should the desired output shape be after applying the DownSample network?
I appreciate your insights on these points and any guidance on how we might approach making the network more adaptable to different input sizes.
Hello, regarding the use of AvgPool2d
to reduce the size of the output feature map to a fixed value, currently set to (64*8*8
) for the 96x96 input image, this approach adheres to the model definition outlined in the MuZero paper.
However, as you rightly observed, the current implementation contains numerous specific operations and lacks generality. To enhance the flexibility of the implementation, we can mandate the environment to preprocess images to a standardized size (such as 96 or 64). Subsequently, we can leverage the num_resblocks
parameter in the DownSample
class to regulate the network's capacity. Furthermore, at this line and this line, we can dynamically define and utilize the final layer based on the specified feature plane size.
If you are interested in optimizing the DownSample
implementation in this manner and submitting a pull request, it would be greatly appreciated. You can start your development from this PR: https://github.com/opendilab/LightZero/pull/254.
Thank you for your detailed response. I'd like to propose a potential enhancement to the network architecture, focusing on the flexibility of channel counts in convolutional layers. While the current channel configuration may suffice for simpler environments like Pong, more complex scenarios might benefit from a more adaptable approach.
I suggest implementing configurable channel counts for each convolutional layer in the ResBlocks. This could potentially allow the network to capture more intricate structures in the input. My proposed structure would look something like this:
Following this, the RepresentationNetworkUniZero could process this output:
If I understand correctly, the primary goal of RepresentationNetworkUniZero is to transform the image into a state vector for use in the transformer.
I have a few questions regarding this approach:
I appreciate your insights on these points and any feedback on the proposed modifications.
Thank you for your attention.
I'm currently conducting experiments with the UniZero neural network using your library. I'm interested in the timeline for implementing flexible input image size support in UniZero, given the presence of related TODOs in the codebase. If this feature isn't planned for the near future, I'd be willing to contribute to its development.
From my understanding, adapting RepresentationNetworkUniZero, DownSample, LatentDecoder, and WorldModel would be necessary. Could you provide guidance on any other components that might need modification to achieve this functionality? Additionally, if you're open to external contributions for this feature, I'd appreciate any specific guidelines or considerations for its implementation.
Thank you for your time and continued development of this valuable library.