Discussion: Requesting assistance and guidance with implementation of RL algorithms and models in the context of Tetris

Hello! I'm trying to implement the many models and algorithms described in this library in the context of Tetris: specifically for multiplayer Tetris where players compete to efficiently clear lines to send as many lines as possible. Currently, I am developing a simple bot that extends beyond simply placing tetrominoes randomly.

Here's what I have: An environment modeled after atari and game_2048 that allows models to interact and train successfully. A Modified rewards system to incentivize more blocks place, emphasizing any lines cleared. A config file to work with EfficientZero, with over 48 single-GPU hours trained.

Here's Some context on the environment trained on: The observation is a 10colx8row one hot encoded board, this is stacked with some more one-hot encoded information such as the current piece, pieces in the queue, and piece held. Each move is encoded as the coordinate of the piece place, the type of the piece placed, and the rotation of the piece, for a one-hot encoded size of 2560. The input size is 144. Currently the model uses mlp. It is worth noting, even after many training iterations, the console still warns that a lot of illegal moves are attempted, despite the action mask being provided for the varied action space. It seems possibly the model is not able to correctly learn the legal actions.

I've also done a small amount of testing using an action space of 10 instead, and some more with reanalyzed set to 0.25 and etc. I am always open to trying everything as I just want to get something going :).

Let me know if there is any more information or resources or context I can provide to facilitate my learning process.

Here's Some graphs from the training.

Here is the total_config.py file for the run:

exp_config = {
    'env': {
        'manager': {
            'episode_num': float("inf"),
            'max_retry': 1,
            'step_timeout': None,
            'auto_reset': True,
            'reset_timeout': None,
            'retry_type': 'reset',
            'retry_waiting_time': 0.1,
            'shared_memory': False,
            'copy_on_get': True,
            'context': 'fork',
            'wait_num': float("inf"),
            'step_wait_timeout': None,
            'connect_timeout': 60,
            'reset_inplace': False,
            'cfg_type': 'SyncSubprocessEnvManagerDict',
            'type': 'subprocess'
        },
        'stop_value': 10000000000,
        'n_evaluator_episode': 4,
        'env_id': 'botris',
        'render_mode': None,
        'replay_format': 'gif',
        'replay_name_suffix': 'eval',
        'replay_path': None,
        'act_scale': True,
        'obs_type': 'dict_encoded_board',
        'reward_normalize': False,
        'reward_norm_scale': 100,
        'reward_type': 'raw',
        'max_score': None,
        'delay_reward_step': 0,
        'prob_random_agent': 0.0,
        'max_episode_steps': 50000000,
        'is_collect': True,
        'ignore_legal_actions': False,
        'cfg_type': 'BotrisEnvDict',
        'type': 'botris',
        'import_names': ['zoo.botris.envs.botris_lightzero_env'],
        'collector_env_num': 8,
        'evaluator_env_num': 4
    },
    'policy': {
        'model': {
            'model_type': 'mlp',
            'continuous_action_space': False,
            'observation_shape': 144,
            'self_supervised_learning_loss': True,
            'categorical_distribution': True,
            'image_channel': 1,
            'frame_stack_num': 1,
            'num_res_blocks': 1,
            'num_channels': 64,
            'support_scale': 300,
            'bias': True,
            'discrete_action_encoding_type': 'one_hot',
            'res_connection_in_dynamics': True,
            'norm_type': 'BN',
            'analysis_sim_norm': False,
            'analysis_dormant_ratio': False,
            'harmony_balance': False,
            'lstm_hidden_size': 256,
            'action_space_size': 2560,
            'latent_state_dim': 256
        },
        'learn': {
            'learner': {
                'train_iterations': 1000000000,
                'dataloader': {
                    'num_workers': 0
                },
                'log_policy': True,
                'hook': {
                    'load_ckpt_before_run': '',
                    'log_show_after_iter': 100,
                    'save_ckpt_after_iter': 10000,
                    'save_ckpt_after_run': True
                },
                'cfg_type': 'BaseLearnerDict'
            }
        },
        'collect': {
            'collector': {
                'deepcopy_obs': False,
                'transform_obs': False,
                'collect_print_freq': 100,
                'cfg_type': 'SampleSerialCollectorDict',
                'type': 'sample'
            }
        },
        'eval': {
            'evaluator': {
                'eval_freq': 1000,
                'render': {
                    'render_freq': -1,
                    'mode': 'train_iter'
                },
                'figure_path': None,
                'cfg_type': 'InteractionSerialEvaluatorDict',
                'stop_value': 10000000000,
                'n_episode': 4
            }
        },
        'other': {
            'replay_buffer': {
                'type': 'advanced',
                'replay_buffer_size': 4096,
                'max_use': float("inf"),
                'max_staleness': float("inf"),
                'alpha': 0.6,
                'beta': 0.4,
                'anneal_step': 100000,
                'enable_track_used_data': False,
                'deepcopy': False,
                'thruput_controller': {
                    'push_sample_rate_limit': {
                        'max': float("inf"),
                        'min': 0
                    },
                    'window_seconds': 30,
                    'sample_min_limit_ratio': 1
                },
                'monitor': {
                    'sampled_data_attr': {
                        'average_range': 5,
                        'print_freq': 200
                    },
                    'periodic_thruput': {
                        'seconds': 60
                    }
                },
                'cfg_type': 'AdvancedReplayBufferDict'
            },
            'commander': {
                'cfg_type': 'BaseSerialCommanderDict'
            }
        },
        'on_policy': False,
        'cuda': True,
        'multi_gpu': False,
        'bp_update_sync': True,
        'traj_len_inf': False,
        'use_rnd_model': False,
        'sampled_algo': False,
        'gumbel_algo': False,
        'mcts_ctree': True,
        'collector_env_num': 8,
        'evaluator_env_num': 4,
        'env_type': 'not_board_games',
        'action_type': 'varied_action_space',
        'battle_mode': 'play_with_bot_mode',
        'monitor_extra_statistics': True,
        'game_segment_length': 50,
        'eval_offline': False,
        'cal_dormant_ratio': False,
        'analysis_sim_norm': False,
        'analysis_dormant_ratio': False,
        'transform2string': False,
        'gray_scale': False,
        'use_augmentation': False,
        'augmentation': ['shift', 'intensity'],
        'ignore_done': False,
        'update_per_collect': None,
        'replay_ratio': 0.25,
        'batch_size': 256,
        'optim_type': 'Adam',
        'learning_rate': 0.003,
        'target_update_freq': 100,
        'target_update_freq_for_intrinsic_reward': 1000,
        'weight_decay': 0.0001,
        'momentum': 0.9,
        'grad_clip_value': 10,
        'n_episode': 8,
        'num_simulations': 50,
        'discount_factor': 0.997,
        'td_steps': 5,
        'num_unroll_steps': 5,
        'reward_loss_weight': 1,
        'value_loss_weight': 0.25,
        'policy_loss_weight': 1,
        'policy_entropy_loss_weight': 0,
        'ssl_loss_weight': 2,
        'lr_piecewise_constant_decay': True,
        'threshold_training_steps_for_final_lr': 50000,
        'manual_temperature_decay': False,
        'threshold_training_steps_for_final_temperature': 100000,
        'fixed_temperature_value': 0.25,
        'use_ture_chance_label_in_chance_encoder': False,
        'reanalyze_noise': True,
        'reuse_search': False,
        'collect_with_pure_policy': False,
        'use_priority': False,
        'priority_prob_alpha': 0.6,
        'priority_prob_beta': 0.4,
        'root_dirichlet_alpha': 0.3,
        'root_noise_weight': 0.25,
        'random_collect_episode_num': 0,
        'eps': {
            'eps_greedy_exploration_in_collect': False,
            'type': 'linear',
            'start': 1.0,
            'end': 0.05,
            'decay': 100000
        },
        'cfg_type': 'EfficientZeroPolicyDict',
        'lstm_horizon_len': 5,
        'type': 'efficientzero',
        'import_names': ['lzero.policy.efficientzero'],
        'model_path': None,
        'device': 'cuda',
        'reanalyze_ratio': 0.0,
        'eval_freq': 200,
        'replay_buffer_size': 1000000
    },
    'exp_name': 'data_ez/botris_efficientzero_ns50_upcNone_rer0.0_seed0',
    'seed': 0
}

Sorry for the late response.

Regarding the action space: I noticed that in your full configuration, the 'action_space_size' is set to 2560. For an MCTS+RL algorithm, this value is indeed quite large. Even in well-studied Atari games, the largest discrete action space is only 18. Although MZ series algorithms can manage higher-dimensional discrete action spaces, 2560 is still excessive, and the current poor performance is likely due to this. Simplifying the action space could enhance efficiency.
Regarding the occurrence of illegal actions, I'm puzzled, as Tetris should have a fixed action space. This might be related to issues in your environment implementation, so I suggest checking the relevant code.
Regarding the observation space: Are you currently using a vector form? Perhaps directly using raw images would be more appropriate, and employing our convolutional-ResNet-based representation network for processing might improve model performance.
Regarding the environment's reward: What level of reward is considered a good convergence state? During training, consider applying some form of normalization to the reward to help stabilize the training process.
Regarding poor performance: Besides the aforementioned MDP design, there might also be issues with the balance between exploration and exploitation. First, confirm whether it's due to insufficient exploration, meaning effective trajectories are not being collected, or insufficient exploitation, meaning good trajectories are collected but the policy/value network fails to effectively learn. You can analyze this by monitoring and printing the policy and some key frames.

If possible, you can submit a PR so we can compare and review your specific code for better discussion. You're also welcome to raise more related discussion questions. Thanks for your attention.

Thank you for the response!

I agree, the action space is too big haha. My justification for an expensive action space is as follows. Tetris can be simplified to about 10 moves, 6-7 if conservative. However, by limiting each move to a rotation or a move_left, there is less of a correlation between the game state immediately after the move is played, and long-distance planning is needed to get a piece to one location. Currently, my action space encodes every possible piece placement, which in theory should allow for better coordination between each move and the corresponding state/reward? Honestly, I am not too sure, however moving forward I will use a simpler and smaller action space of just the 10 possible actions instead of all piece placements.
The reason why I have illegal actions is also from above. Since not all pieces can be placed at all locations at any given time, there are legal and illegal actions.
Yes, currently I am using a vector form. As I've also encoded some one hot encodings for details such as the current combo, next piece type, current piece type, and possibly queue. I am open to processing raw images but I wonder if it would be better/faster?
A good convergence state for me would be anything significantly better than random at this point. In my training, the reward has not gotten to a scale where I am worried about normalization. I am not too sure. I will consult the repository to see the reward normalization process. Thank you for pointing this out.
I agree that poor performance may be from insufficient exploration. I will look into alternative methods of expanding this. Currently from my knowledge the MCTS temperature is static at 0.25.

I appreciate the notes and the helpful advice! I truly appreciate you taking the time to provide me with your insightful knowledge. I understand that nothing can be said through a GitHub issue and without any code, so I will submit a PR soon and ask again for your advice. I wonder, however, if MZ series algorithms are not the right direction for such an application. I am always open to more suggestions for improving my current setup. Thank you again!

Regarding the action space, I recommend employing the most fundamental five discrete actions: rotate, move left, move right, move down, and drop to the bottom. When Tetris is modeled as a Markov Decision Process (MDP) environment, everything is deterministic except for the randomness introduced when a new piece appears at the top of the screen. These five discrete actions completely and accurately encapsulate the game mechanics, serving as the simplest and most comprehensive choice for reinforcement learning algorithms.
If you can obtain the environment state in vector form and use it as observations (obs), this approach is highly reasonable. You can first debug under this setup and then expand to image-based observations (image obs).
Opting to use the MuZero-style algorithm to address the Tetris problem is a prudent decision, as it effectively manages complex and dynamic environments and facilitates long-term planning.
You can deselect the two options in the top left corner of TensorBoard to fully display the original curves for easier analysis. Additionally, capturing the TensorBoard curves related to the collector can provide further insights.
As for why the current performance is akin to a random strategy, I suspect this may relate to the handling of observations (obs) and action masks (action_mask). The current average episode length for evaluation is approximately 18, suggesting that the algorithm has not effectively learned to clear lines. This might be because the collector has not gathered high-quality episodes, leading to an inability to learn an effective strategy. I would also like to confirm whether the implementation related to *botris_5move* has adopted the aforementioned five-dimensional discrete action space and whether it is currently a single-player environment.

Best wishes!

opendilab / LightZero

Discussion: Requesting assistance and guidance with implementation of RL algorithms and models in the context of Tetris #265