LightZero on HPC and other questions

selfsim commented 6 months ago

Hello, I am considering LightZero as my workhorse for MBRL research. I have a few questions, feel free to just link to relevant file(s) and I can parse it myself.

Can LightZero be deployed in distributed computing environments? I have access to an HPC that is configured via SLURM. Would I be able to take advantage of multiple CPU/GPU resources on multiple nodes?
Is there built-in support for experiment monitoring/profiling? If so, where are the docs for this? How extensive are the logging capabilities (network weights?, replay buffer prios?, etc...)

Thank you for this contribution and I look forward to hearing from you soon.

edit: read some of paper, questions answered.

puyuan1996 commented 6 months ago

Hello, we currently support multi-GPU training on a single node through PyTorch's Distributed Data Parallel (DDP) technology. Please refer to the discussion on this issue: https://github.com/opendilab/LightZero/issues/196. As for multi-node training, we plan to consider incorporating this functionality in future version updates.

At present, we have integrated experimental monitoring and performance analysis support that is supported by DI-engine into LightZero. For detailed information, please consult this document (https://github.com/opendilab/DI-engine-docs/blob/main/source/04_best_practice/training_generated_folders_zh.rst, Chinese version). For your convenience, we have provided the following English summary and will subsequently integrate it fully into our codebase documentation. Thank you for your suggestion. Best regards.

puyuan1996 commented 6 months ago

Experimental monitoring and logging system in LightZero

LightZero generates log and checkpoint folders during the training process. The file tree generated is as follows:

cartpole_muzero
├── ckpt
│   ├── ckpt_best.pth.tar
│   ├── iteration_0.pth.tar
│   └── iteration_10000.pth.tar
├── log
│   ├── buffer
│   │   └── buffer_logger.txt
│   ├── collector
│   │   └── collector_logger.txt
│   ├── evaluator
│   │   └── evaluator_logger.txt
│   ├── learner
│   │   └── learner_logger.txt
│   └── serial
│       └── events.out.tfevents.1626453528.CN0014009700M.local
├── formatted_total_config.py
└── total_config.py

log/collector

In the collector folder, there is a file named collector_logger.txt, which contains information related to the interaction between the collector and the environment. Special information generated when the collector interacts with the environment, such as:

episode_count: the number of episodes collected
envstep_count: the number of envsteps collected
train_sample_count: the number of training sample data
avg_envstep_per_episode: the average envstep per episode
avg_sample_per_episode: the average number of samples per episode
avg_envstep_per_sec: the average env_step per second
avg_train_sample_per_sec: the average number of training samples per second
avg_episode_per_sec: the average number of episodes per second
collect_time: collection time
reward_mean: the average reward
reward_std: the standard deviation of the reward
each_reward: the reward for each episode of the collector's interaction with the environment.
reward_max: the maximum reward
reward_min: the minimum reward
total_envstep_count: the total envstep count
total_train_sample_count: the total number of training samples
total_episode_count: the total number of episodes
total_duration: the total duration

log/evaluator

In the evaluator folder, there is a file named evaluator_logger.txt, which contains information about the evaluator's interaction with the environment.

[INFO]: [EVALUATOR]env x completes an episode, final reward: xxx, current episode: xxx
train_iter: the number of training iterations
ckpt_name: the model path, such as iteration_0.pth.tar
episode_count: episode count
envstep_count: envstep count
evaluate_time: the time spent by the evaluator
avg_envstep_per_episode: the average envstep per episode
avg_envstep_per_sec: the average envstep per second
avg_time_per_episode: the average time per episode per second
reward_mean: the average reward
reward_std: the standard deviation of the reward
each_reward: the reward for each episode of the evaluator's interaction with the environment.
reward_max: the maximum reward
reward_min: the minimum reward

log/learner

In the learner folder, there is a file named learner_logger.txt, which contains information about the learner. The following information is generated during the MuZero training period:

Policy neural network architecture:

[04-08 13:12:59] INFO     [RANK0]: DI-engine DRL Policy                                                                                                base_learner.py:338
                          MuZeroModelMLP(                                                                                                                                 
                            (representation_network): RepresentationNetworkMLP(                                                                                           
                              (fc_representation): Sequential(                                                                                                            
                                (0): Linear(in_features=4, out_features=128, bias=True)                                                                                   
                                (1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)                                                     
                                (2): ReLU(inplace=True)                                                                                                                   
                                (3): Linear(in_features=128, out_features=128, bias=True)                                                                                 
                              )                                                                                                                                                                                                                                                                                

Learner information:
    Grid table:
    | Name  | cur_lr_avg | total_loss_avg |
    |-------|------------|----------------|
    | Value | 0.001000   | 0.098996       |

log/serial

The buffer, collector, evaluator, and learner's relevant information is saved into a file named events.out.tfevents for use with tensorboard.

LightZero saves all tensorboard files from the serial folder as one tensorboard file, rather than individual folders. This is because when running a large number of experiments, say n, it is not easy to distinguish between 4*n individual tensorboard files. Therefore, in LightZero, all tensorboard files are in the serial folder.

ckpt

In the ckpt folder, there are model parameter checkpoints:

ckpt_best.pth.tar. The best model that achieved the highest evaluation score.
"iteration" + iter number. Models saved every iter_number. You can load the model using torch.load('ckpt_best.pth.tar').

selfsim commented 6 months ago

Thanks for the information.

opendilab / LightZero

LightZero on HPC and other questions #207