opendilab / LightZero

[NeurIPS 2023 Spotlight] LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios (awesome MCTS)
https://huggingface.co/spaces/OpenDILabCommunity/ZeroPal
Apache License 2.0
1.05k stars 108 forks source link

how to solve reward dropping after reaching super humain level #97

Closed amineoui closed 10 months ago

amineoui commented 1 year ago

how to solve reward dropping after reaching super humain level , or how to save model on this top level , before its start dropping image

puyuan1996 commented 1 year ago
amineoui commented 1 year ago

algo: sampled_efficientzero env: im trying simulating market trading as my custom env using gaf features , is like work for 1 month of dataset tranning but not work for 1 year data set training im also wondring about how to enter direct data or transformed data with shape like (7, 9) with mlp model_type

this is my config:

image_channel=7 shape=(7, 9, 9) file_name = 'shape/shape_7_9_9_1month.npy'

collector_env_num = 16 n_episode = 16 evaluator_env_num = 4 continuous_action_space = False K = 3 # num_of_sampled_actions num_simulations = 10 update_per_collect = 10 batch_size = 256 max_env_step = int(1e9) reanalyze_ratio = 0.9

data_sampled_efficientzero_config = dict( exp_name= f'result/stocks_sampled_efficientzero_ns{num_simulations}_upc{update_per_collect}_rr{reanalyze_ratio}_seed0', env=dict( env_name='my_custom_env', env_id='my_custom_env', env_file_name= file_name, obs_shape=shape, collector_env_num=collector_env_num, evaluator_env_num=evaluator_env_num, n_evaluator_episode=evaluator_env_num, manager=dict(shared_memory=False, ), ), policy=dict( model=dict( model_type='conv', #mlp, conv observation_shape=shape, frame_stack_num=1, image_channel=image_channel, action_space_size=K,

downsample=True,

        lstm_hidden_size=512,
        latent_state_dim=512,
        continuous_action_space=continuous_action_space,
        num_of_sampled_actions=K,
        discrete_action_encoding_type='one_hot',
        norm_type='BN', 
    ),
    cuda=True,
    env_type='not_board_games',
    game_segment_length=400,
    # use_augmentation=True,
    update_per_collect=update_per_collect,
    batch_size=batch_size,
    optim_type='Adam',
    lr_piecewise_constant_decay=False,
    learning_rate=0.001,
    num_simulations=num_simulations,
    reanalyze_ratio=reanalyze_ratio,
    policy_loss_type='cross_entropy',
    n_episode=n_episode,
    eval_freq=int(2e2),
    replay_buffer_size=int(1e9),  # the size/capacity of replay_buffer, in the terms of transitions.
    collector_env_num=collector_env_num,
    evaluator_env_num=evaluator_env_num,
),
puyuan1996 commented 12 months ago

Hello,

Here are some modification recommendations to your configuration file, mainly focusing on the following aspects:

collector_env_num = 8
n_episode = 8
evaluator_env_num = 5
num_simulations = 50
update_per_collect = 200
replay_buffer_size=int(1e6), 
game_segment_length=400, # TODO: adjust according to your episode length

These optimization suggestions aim to enhance the model's performance while maintaining a balance in efficiency and memory usage. I hope you find these recommendations helpful.

amineoui commented 12 months ago

(7, 9, 9) mean 7 images with size of 9x9 , i also found some problem with that because i should declare it as (7, 9, 9) and feed it to model as (9, 9, 7) this the only way i got it to work i apply this code to change the shape without affecting the images:

    def restack(self, gaf_images):
        images = []
        for i, gaf_image in enumerate(gaf_images):
            images.append(gaf_image)
        image_tensor = np.stack(images, axis=-1)
        return image_tensor

is this correct or i made a mistake ?

also what about the neural network size and also the hidden layers , i think is also important to be able to handle more data ? or im wrong ? if yes please what the recommendation like change fc_policy_layers , fc_value_layers .... on model

thank you so much @puyuan1996

puyuan1996 commented 12 months ago

Hello,

Best wishes for your experiments.

amineoui commented 11 months ago

Hello, Mr. @puyuan1996! I want to express my sincere gratitude for your kindness, and I must say that this repository is truly an astonishing work of AI art. Your effort and dedication shine brightly in this project, and it's genuinely commendable. Great job!

im trying to teach ai to ebserve only and no take action on an expiration time to get reward then will be able to take an other action i mean is stay observing and learning with no action tell got reward then allow to make other action

is this possible ?

i think about this this parammetres ? but im not sure please can you guide me

to_play=-1
action_mask = np.array([1., 1., 1.], dtype=np.float32)
obs = {'observation': to_ndarray(obs), 'action_mask': action_mask, 'to_play': to_play}

i try : to_play=-1 action_mask [0., 1., 0.] but its give me error on child_visit_segment it will be like [1] object array

i also try: to_play=-1 as ai and to_play=1 as other player action_mask = np.array([1., 1., 1.], dtype=np.float32)

puyuan1996 commented 11 months ago

Hello,

Regarding your question about the special environment's MDP:

Regarding your question about action_mask and to_play:

Best Wishes.

amineoui commented 11 months ago

Hello, Mr. @puyuan1996 , thank you so much for your help and kindness, i notice that ckpt_best.pth.tar not save every new best evaluation on trainning , what the factor is take to decide save ckpt_best.pth.tar , because is like save 1 to 3 times and not more even reach multi more better points , sometimes save ! i not clearly understand the factors or parrameter that control it

also im still have sometimes spikes on my gpu and memory limitation , memory should just not feed big resolution data this work even just spikes on gpu 3d calculation , is work and train just take some time

image

im really wonder about way ckpt_best.pth.tar not save , my last trainning is save only on first time even is going learning. its like base on reward_std ? can i change it to other thing ?

also i have error on eval after finish , this returns is list of None: [None, None, None, ...] image

puyuan1996 commented 11 months ago

Hello,

Regarding the storage frequency of model checkpoints (ckpt), LightZero's underlying implementation is based on DI-engine, which uses a hook mechanism to save the model's checkpoints. You can refer to the test file for more details. You can adjust the following settings under the policy field in the configuration file to change the storage frequency of the model checkpoints:

 policy=dict(
    ...
    learn=dict(
        learner=dict(
            hook=dict(
                save_ckpt_after_iter=200,
                save_ckpt_after_run=True,
                log_show_after_iter=100,
            ),
        ),
    ),
    ...
 ),

In this configuration:

Regarding the return value error of eval_muzero, this is due to a change in the muzero_evaluator API. If you pull the latest code, this issue should no longer exist.

Good luck!