Closed selfsim closed 6 months ago
Hello, we currently support multi-GPU training on a single node through PyTorch's Distributed Data Parallel (DDP) technology. Please refer to the discussion on this issue: https://github.com/opendilab/LightZero/issues/196. As for multi-node training, we plan to consider incorporating this functionality in future version updates.
At present, we have integrated experimental monitoring and performance analysis support that is supported by DI-engine into LightZero. For detailed information, please consult this document (https://github.com/opendilab/DI-engine-docs/blob/main/source/04_best_practice/training_generated_folders_zh.rst, Chinese version). For your convenience, we have provided the following English summary and will subsequently integrate it fully into our codebase documentation. Thank you for your suggestion. Best regards.
Experimental monitoring and logging system in LightZero
LightZero generates log
and checkpoint
folders during the training process. The file tree generated is as follows:
cartpole_muzero
├── ckpt
│ ├── ckpt_best.pth.tar
│ ├── iteration_0.pth.tar
│ └── iteration_10000.pth.tar
├── log
│ ├── buffer
│ │ └── buffer_logger.txt
│ ├── collector
│ │ └── collector_logger.txt
│ ├── evaluator
│ │ └── evaluator_logger.txt
│ ├── learner
│ │ └── learner_logger.txt
│ └── serial
│ └── events.out.tfevents.1626453528.CN0014009700M.local
├── formatted_total_config.py
└── total_config.py
log/collector
In the collector
folder, there is a file named collector_logger.txt
, which contains information related to the interaction between the collector and the environment.
Special information generated when the collector interacts with the environment, such as:
log/evaluator
In the evaluator
folder, there is a file named evaluator_logger.txt
, which contains information about the evaluator's interaction with the environment.
log/learner
In the learner
folder, there is a file named learner_logger.txt
, which contains information about the learner.
The following information is generated during the MuZero training period:
Policy neural network architecture:
[04-08 13:12:59] INFO [RANK0]: DI-engine DRL Policy base_learner.py:338
MuZeroModelMLP(
(representation_network): RepresentationNetworkMLP(
(fc_representation): Sequential(
(0): Linear(in_features=4, out_features=128, bias=True)
(1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Linear(in_features=128, out_features=128, bias=True)
)
Learner information:
Grid table:
| Name | cur_lr_avg | total_loss_avg |
|-------|------------|----------------|
| Value | 0.001000 | 0.098996 |
log/serial
The buffer, collector, evaluator, and learner's relevant information is saved into a file named events.out.tfevents for use with tensorboard.
LightZero saves all tensorboard files from the serial folder as one tensorboard file, rather than individual folders. This is because when running a large number of experiments, say n, it is not easy to distinguish between 4*n individual tensorboard files. Therefore, in LightZero, all tensorboard files are in the serial folder.
ckpt
In the ckpt
folder, there are model parameter checkpoints:
torch.load('ckpt_best.pth.tar')
.Thanks for the information.
Hello, I am considering LightZero as my workhorse for MBRL research. I have a few questions, feel free to just link to relevant file(s) and I can parse it myself.
Can LightZero be deployed in distributed computing environments? I have access to an HPC that is configured via SLURM. Would I be able to take advantage of multiple CPU/GPU resources on multiple nodes?
Is there built-in support for experiment monitoring/profiling? If so, where are the docs for this? How extensive are the logging capabilities (network weights?, replay buffer prios?, etc...)
Thank you for this contribution and I look forward to hearing from you soon.
edit: read some of paper, questions answered.