Closed amineoui closed 10 months ago
Hello, in order to provide a more precise solution to your question, we need more detailed information about your task. Could you please provide specifics about the environment, the algorithm used, and the config file you are currently working with?
If you're using the LightZero
framework, we have the optimal model from the historical evaluation saved at path like this: zoo/classic_control/cartpole/config/data_mz_ctree/cartpole_muzero_seed0/ckpt/ckpt_best.pth.tar
. You can load this optimal model from the training process by specifying this path in the model_path
field in your config file.
Please note that the path provided above is merely an example, and your actual path may vary based on your project setup and configurations.
algo: sampled_efficientzero env: im trying simulating market trading as my custom env using gaf features , is like work for 1 month of dataset tranning but not work for 1 year data set training im also wondring about how to enter direct data or transformed data with shape like (7, 9) with mlp model_type
this is my config:
image_channel=7 shape=(7, 9, 9) file_name = 'shape/shape_7_9_9_1month.npy'
collector_env_num = 16 n_episode = 16 evaluator_env_num = 4 continuous_action_space = False K = 3 # num_of_sampled_actions num_simulations = 10 update_per_collect = 10 batch_size = 256 max_env_step = int(1e9) reanalyze_ratio = 0.9
data_sampled_efficientzero_config = dict( exp_name= f'result/stocks_sampled_efficientzero_ns{num_simulations}_upc{update_per_collect}_rr{reanalyze_ratio}_seed0', env=dict( env_name='my_custom_env', env_id='my_custom_env', env_file_name= file_name, obs_shape=shape, collector_env_num=collector_env_num, evaluator_env_num=evaluator_env_num, n_evaluator_episode=evaluator_env_num, manager=dict(shared_memory=False, ), ), policy=dict( model=dict( model_type='conv', #mlp, conv observation_shape=shape, frame_stack_num=1, image_channel=image_channel, action_space_size=K,
lstm_hidden_size=512,
latent_state_dim=512,
continuous_action_space=continuous_action_space,
num_of_sampled_actions=K,
discrete_action_encoding_type='one_hot',
norm_type='BN',
),
cuda=True,
env_type='not_board_games',
game_segment_length=400,
# use_augmentation=True,
update_per_collect=update_per_collect,
batch_size=batch_size,
optim_type='Adam',
lr_piecewise_constant_decay=False,
learning_rate=0.001,
num_simulations=num_simulations,
reanalyze_ratio=reanalyze_ratio,
policy_loss_type='cross_entropy',
n_episode=n_episode,
eval_freq=int(2e2),
replay_buffer_size=int(1e9), # the size/capacity of replay_buffer, in the terms of transitions.
collector_env_num=collector_env_num,
evaluator_env_num=evaluator_env_num,
),
Hello,
Here are some modification recommendations to your configuration file, mainly focusing on the following aspects:
update_per_collect
in the configuration, which should allow for more frequent network updates after each round of data collection. By increasing the value of update_per_collect, the network will have the opportunity to update more frequently and potentially adapt to the collected data more quickly. This can be beneficial in scenarios where the data distribution is non-stationary or changes rapidly over time.collector_env_num = 8
n_episode = 8
evaluator_env_num = 5
num_simulations = 50
update_per_collect = 200
replay_buffer_size=int(1e6),
game_segment_length=400, # TODO: adjust according to your episode length
These optimization suggestions aim to enhance the model's performance while maintaining a balance in efficiency and memory usage. I hope you find these recommendations helpful.
(7, 9, 9) mean 7 images with size of 9x9 , i also found some problem with that because i should declare it as (7, 9, 9) and feed it to model as (9, 9, 7) this the only way i got it to work i apply this code to change the shape without affecting the images:
def restack(self, gaf_images):
images = []
for i, gaf_image in enumerate(gaf_images):
images.append(gaf_image)
image_tensor = np.stack(images, axis=-1)
return image_tensor
is this correct or i made a mistake ?
also what about the neural network size and also the hidden layers , i think is also important to be able to handle more data ? or im wrong ? if yes please what the recommendation like change fc_policy_layers , fc_value_layers .... on model
thank you so much @puyuan1996
Hello,
Your method to reshape the image stack from (7, 9, 9) to (9, 9, 7) seems correct. The restack function you wrote is essentially moving the first axis (which has 7 elements) to the end. Here is the simplified version of your function using numpy's built-in transpose function:
def restack(self, gaf_images):
"""
Restack the images along the last dimension.
Args:
gaf_images (np.array): array of images with shape (7, 9, 9).
Returns:
image_tensor (np.array): reshaped array of images with shape (9, 9, 7).
"""
image_tensor = np.transpose(gaf_images, (1, 2, 0))
return image_tensor
This function will transpose the tensor from shape (7, 9, 9) to (9, 9, 7). However, for our implementation of the MuZero algorithm, the input to a conv type model should indeed be in the form of images with a shape like (7,9,9). In this case, the first dimension represents the number of channels, while the following two dimensions correspond to the width and height of the image, respectively. You may refer to the existing Atari MuZero configuration as an example.
Based on our experimental experience, the default configuration of LightZero like this should provide adequate network capacity for tasks with complexity on par with Atari games. The performance degradation observed in your experiments is likely due to other factors described here.
We recommend you to adjust and optimize your configuration parameters following the guidance provided earlier, and then proceed with the experimental testing again. We anticipate that these revisions will lead to improved experimental outcomes.
Best wishes for your experiments.
Hello, Mr. @puyuan1996! I want to express my sincere gratitude for your kindness, and I must say that this repository is truly an astonishing work of AI art. Your effort and dedication shine brightly in this project, and it's genuinely commendable. Great job!
im trying to teach ai to ebserve only and no take action on an expiration time to get reward then will be able to take an other action i mean is stay observing and learning with no action tell got reward then allow to make other action
is this possible ?
i think about this this parammetres ? but im not sure please can you guide me
to_play=-1
action_mask = np.array([1., 1., 1.], dtype=np.float32)
obs = {'observation': to_ndarray(obs), 'action_mask': action_mask, 'to_play': to_play}
i try : to_play=-1 action_mask [0., 1., 0.] but its give me error on child_visit_segment it will be like [1] object array
i also try: to_play=-1 as ai and to_play=1 as other player action_mask = np.array([1., 1., 1.], dtype=np.float32)
Hello,
Regarding your question about the special environment's MDP:
Regarding your question about action_mask
and to_play
:
to_play
is an integer variable used in board game environments, indicating the index of the player who needs to take the next action. Its value range is {1, 2}
. However, for single-player game environments such as Atari and single-player board games like 2048, to_play
should be set to -1
, indicating that this is a single-player environment.action_mask
is an A-dimensional numpy array, representing the valid actions in environments where the action space can vary. For example, in the tic-tac-toe environment, where the original complete discrete action space is 9-dimensional, action_mask
is a 9-dimensional numpy array. A value of 1
indicates that the corresponding action is valid, while a value of 0
indicates that the action is invalid. For environments like Atari with a fixed discrete action space, action_mask
should be an all-1
numpy array. For continuous action space environments like MuJoCo, action_mask
should be set to None
, as the size of its valid action set is infinite.not yet
conducted tests in multi-player (more than two players) adversarial game
environments. If you plan to integrate the LightZero algorithm into a multi-player game (like Dou Di Zhu), you may need to adjust the code accordingly. For multi-agent cooperative environments, you could consider using the concepts from Multi-Agent Reinforcement Learning (MARL). One initial idea is to regard it as an independent learning process. For specific implementation, you can refer to this paper, and our example cases in pettingzoo and GoBigger environments, detailed in this PR.Best Wishes.
Hello, Mr. @puyuan1996 , thank you so much for your help and kindness, i notice that ckpt_best.pth.tar not save every new best evaluation on trainning , what the factor is take to decide save ckpt_best.pth.tar , because is like save 1 to 3 times and not more even reach multi more better points , sometimes save ! i not clearly understand the factors or parrameter that control it
also im still have sometimes spikes on my gpu and memory limitation , memory should just not feed big resolution data this work even just spikes on gpu 3d calculation , is work and train just take some time
im really wonder about way ckpt_best.pth.tar not save , my last trainning is save only on first time even is going learning. its like base on reward_std ? can i change it to other thing ?
also i have error on eval after finish , this returns is list of None: [None, None, None, ...]
Hello,
Regarding the storage frequency of model checkpoints (ckpt), LightZero's underlying implementation is based on DI-engine, which uses a hook mechanism to save the model's checkpoints. You can refer to the test file for more details. You can adjust the following settings under the policy
field in the configuration file to change the storage frequency of the model checkpoints:
policy=dict(
...
learn=dict(
learner=dict(
hook=dict(
save_ckpt_after_iter=200,
save_ckpt_after_run=True,
log_show_after_iter=100,
),
),
),
...
),
In this configuration:
save_ckpt_after_iter
parameter controls how often a model checkpoint is saved after a certain number of iterations.save_ckpt_after_run
parameter indicates whether to save the model again after all specified training iterations have ended.log_show_after_iter
parameter is used to set the frequency of displaying training statistics on the command line.Regarding the return value error of eval_muzero
, this is due to a change in the muzero_evaluator
API. If you pull the latest code, this issue should no longer exist.
Good luck!
how to solve reward dropping after reaching super humain level , or how to save model on this top level , before its start dropping