opendilab / LightZero

[NeurIPS 2023 Spotlight] LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios (awesome MCTS)
https://huggingface.co/spaces/OpenDILabCommunity/ZeroPal
Apache License 2.0
1.04k stars 108 forks source link

Discussion: Requesting assistance and guidance with implementation of RL algorithms and models in the context of Tetris #265

Open lunathanael opened 1 month ago

lunathanael commented 1 month ago

Hello! I'm trying to implement the many models and algorithms described in this library in the context of Tetris: specifically for multiplayer Tetris where players compete to efficiently clear lines to send as many lines as possible. Currently, I am developing a simple bot that extends beyond simply placing tetrominoes randomly.

Here's what I have: An environment modeled after atari and game_2048 that allows models to interact and train successfully. A Modified rewards system to incentivize more blocks place, emphasizing any lines cleared. A config file to work with EfficientZero, with over 48 single-GPU hours trained.

Here's Some context on the environment trained on: The observation is a 10colx8row one hot encoded board, this is stacked with some more one-hot encoded information such as the current piece, pieces in the queue, and piece held. Each move is encoded as the coordinate of the piece place, the type of the piece placed, and the rotation of the piece, for a one-hot encoded size of 2560. The input size is 144. Currently the model uses mlp. It is worth noting, even after many training iterations, the console still warns that a lot of illegal moves are attempted, despite the action mask being provided for the varied action space. It seems possibly the model is not able to correctly learn the legal actions.

I've also done a small amount of testing using an action space of 10 instead, and some more with reanalyzed set to 0.25 and etc. I am always open to trying everything as I just want to get something going :).

Let me know if there is any more information or resources or context I can provide to facilitate my learning process.

Here's Some graphs from the training. image image

Here is the total_config.py file for the run:

exp_config = {
    'env': {
        'manager': {
            'episode_num': float("inf"),
            'max_retry': 1,
            'step_timeout': None,
            'auto_reset': True,
            'reset_timeout': None,
            'retry_type': 'reset',
            'retry_waiting_time': 0.1,
            'shared_memory': False,
            'copy_on_get': True,
            'context': 'fork',
            'wait_num': float("inf"),
            'step_wait_timeout': None,
            'connect_timeout': 60,
            'reset_inplace': False,
            'cfg_type': 'SyncSubprocessEnvManagerDict',
            'type': 'subprocess'
        },
        'stop_value': 10000000000,
        'n_evaluator_episode': 4,
        'env_id': 'botris',
        'render_mode': None,
        'replay_format': 'gif',
        'replay_name_suffix': 'eval',
        'replay_path': None,
        'act_scale': True,
        'obs_type': 'dict_encoded_board',
        'reward_normalize': False,
        'reward_norm_scale': 100,
        'reward_type': 'raw',
        'max_score': None,
        'delay_reward_step': 0,
        'prob_random_agent': 0.0,
        'max_episode_steps': 50000000,
        'is_collect': True,
        'ignore_legal_actions': False,
        'cfg_type': 'BotrisEnvDict',
        'type': 'botris',
        'import_names': ['zoo.botris.envs.botris_lightzero_env'],
        'collector_env_num': 8,
        'evaluator_env_num': 4
    },
    'policy': {
        'model': {
            'model_type': 'mlp',
            'continuous_action_space': False,
            'observation_shape': 144,
            'self_supervised_learning_loss': True,
            'categorical_distribution': True,
            'image_channel': 1,
            'frame_stack_num': 1,
            'num_res_blocks': 1,
            'num_channels': 64,
            'support_scale': 300,
            'bias': True,
            'discrete_action_encoding_type': 'one_hot',
            'res_connection_in_dynamics': True,
            'norm_type': 'BN',
            'analysis_sim_norm': False,
            'analysis_dormant_ratio': False,
            'harmony_balance': False,
            'lstm_hidden_size': 256,
            'action_space_size': 2560,
            'latent_state_dim': 256
        },
        'learn': {
            'learner': {
                'train_iterations': 1000000000,
                'dataloader': {
                    'num_workers': 0
                },
                'log_policy': True,
                'hook': {
                    'load_ckpt_before_run': '',
                    'log_show_after_iter': 100,
                    'save_ckpt_after_iter': 10000,
                    'save_ckpt_after_run': True
                },
                'cfg_type': 'BaseLearnerDict'
            }
        },
        'collect': {
            'collector': {
                'deepcopy_obs': False,
                'transform_obs': False,
                'collect_print_freq': 100,
                'cfg_type': 'SampleSerialCollectorDict',
                'type': 'sample'
            }
        },
        'eval': {
            'evaluator': {
                'eval_freq': 1000,
                'render': {
                    'render_freq': -1,
                    'mode': 'train_iter'
                },
                'figure_path': None,
                'cfg_type': 'InteractionSerialEvaluatorDict',
                'stop_value': 10000000000,
                'n_episode': 4
            }
        },
        'other': {
            'replay_buffer': {
                'type': 'advanced',
                'replay_buffer_size': 4096,
                'max_use': float("inf"),
                'max_staleness': float("inf"),
                'alpha': 0.6,
                'beta': 0.4,
                'anneal_step': 100000,
                'enable_track_used_data': False,
                'deepcopy': False,
                'thruput_controller': {
                    'push_sample_rate_limit': {
                        'max': float("inf"),
                        'min': 0
                    },
                    'window_seconds': 30,
                    'sample_min_limit_ratio': 1
                },
                'monitor': {
                    'sampled_data_attr': {
                        'average_range': 5,
                        'print_freq': 200
                    },
                    'periodic_thruput': {
                        'seconds': 60
                    }
                },
                'cfg_type': 'AdvancedReplayBufferDict'
            },
            'commander': {
                'cfg_type': 'BaseSerialCommanderDict'
            }
        },
        'on_policy': False,
        'cuda': True,
        'multi_gpu': False,
        'bp_update_sync': True,
        'traj_len_inf': False,
        'use_rnd_model': False,
        'sampled_algo': False,
        'gumbel_algo': False,
        'mcts_ctree': True,
        'collector_env_num': 8,
        'evaluator_env_num': 4,
        'env_type': 'not_board_games',
        'action_type': 'varied_action_space',
        'battle_mode': 'play_with_bot_mode',
        'monitor_extra_statistics': True,
        'game_segment_length': 50,
        'eval_offline': False,
        'cal_dormant_ratio': False,
        'analysis_sim_norm': False,
        'analysis_dormant_ratio': False,
        'transform2string': False,
        'gray_scale': False,
        'use_augmentation': False,
        'augmentation': ['shift', 'intensity'],
        'ignore_done': False,
        'update_per_collect': None,
        'replay_ratio': 0.25,
        'batch_size': 256,
        'optim_type': 'Adam',
        'learning_rate': 0.003,
        'target_update_freq': 100,
        'target_update_freq_for_intrinsic_reward': 1000,
        'weight_decay': 0.0001,
        'momentum': 0.9,
        'grad_clip_value': 10,
        'n_episode': 8,
        'num_simulations': 50,
        'discount_factor': 0.997,
        'td_steps': 5,
        'num_unroll_steps': 5,
        'reward_loss_weight': 1,
        'value_loss_weight': 0.25,
        'policy_loss_weight': 1,
        'policy_entropy_loss_weight': 0,
        'ssl_loss_weight': 2,
        'lr_piecewise_constant_decay': True,
        'threshold_training_steps_for_final_lr': 50000,
        'manual_temperature_decay': False,
        'threshold_training_steps_for_final_temperature': 100000,
        'fixed_temperature_value': 0.25,
        'use_ture_chance_label_in_chance_encoder': False,
        'reanalyze_noise': True,
        'reuse_search': False,
        'collect_with_pure_policy': False,
        'use_priority': False,
        'priority_prob_alpha': 0.6,
        'priority_prob_beta': 0.4,
        'root_dirichlet_alpha': 0.3,
        'root_noise_weight': 0.25,
        'random_collect_episode_num': 0,
        'eps': {
            'eps_greedy_exploration_in_collect': False,
            'type': 'linear',
            'start': 1.0,
            'end': 0.05,
            'decay': 100000
        },
        'cfg_type': 'EfficientZeroPolicyDict',
        'lstm_horizon_len': 5,
        'type': 'efficientzero',
        'import_names': ['lzero.policy.efficientzero'],
        'model_path': None,
        'device': 'cuda',
        'reanalyze_ratio': 0.0,
        'eval_freq': 200,
        'replay_buffer_size': 1000000
    },
    'exp_name': 'data_ez/botris_efficientzero_ns50_upcNone_rer0.0_seed0',
    'seed': 0
}
puyuan1996 commented 4 weeks ago

Sorry for the late response.

If possible, you can submit a PR so we can compare and review your specific code for better discussion. You're also welcome to raise more related discussion questions. Thanks for your attention.

lunathanael commented 4 weeks ago

Thank you for the response!

I appreciate the notes and the helpful advice! I truly appreciate you taking the time to provide me with your insightful knowledge. I understand that nothing can be said through a GitHub issue and without any code, so I will submit a PR soon and ask again for your advice. I wonder, however, if MZ series algorithms are not the right direction for such an application. I am always open to more suggestions for improving my current setup. Thank you again!

puyuan1996 commented 3 weeks ago

Best wishes!