Refactor to support PyTorch 2.0 and Lightning 2.0

hrukalive commented 1 year ago

使用 PyTorch Lightning 2.0 作为框架，并对 base_task 和 acoustic_task 进行了适配。
- 自动多节点训练（CPU、单GPU、多GPU等）。
- 支持使用其他精度设置进行训练，例如'bf16'。
- 支持了梯度累积。
- 使用 K 个最近验证的检查点，并支持原有的永久检查点和间隔。
- 将采样器及其分布式版本重新编码为 PyTorch 的Sampler类的子类。
  - 重新实现了按相似帧数分组的样本Shuffle。
- 使用PL的rank_zero工具来甄别主进程
升级到 PyTorch 2.0，对 torch.compile 的支持还有待测试。
- 为了与大多数深度学习服务提供商兼容，也支持 PyTorch 1.12/1.13，但是 Lightning 需要是 2.0.0 版本
- （在 scripts/train.py 中，环境变量 TORCH_CUDNN_V8_API_ENABLED 的设置是防止在使用 16 位精度时过慢，如果它导致任何问题，请尝试将其注释掉。）
使用 HDF5 作为新的二进制数据集格式，以避免潜在的文件句柄争用。
更新了数据准备 Notebook，反映 PL 相关参数。

新参数说明：

PyTorch Lightning 相关：
- pl_trainer_accelerator、pl_trainer_devices、pl_trainer_num_nodes、pl_trainer_strategy 和 pl_trainer_precision 请参阅 PyTorch Lightning 2.0 文档以了解它们在 Trainer 中的用法。
- 对于 pl_trainer_devices，它可以是：
  - pl_trainer_devices: 'auto': 自动选择
  - pl_trainer_devices: 2: 使用两个加速器，自动选择
  - pl_trainer_devices: [2, 3]: 使用 2 号和 3 号加速器
- ddp_backend, 可选项有 'gloo', 'nccl', 或 'nccl_no_p2p'。
数据加载器相关：
- config/base.yaml 中的sampler_frame_count_grid：现在正确支持具有相似大小的样本的随机混洗。首先，样本长度被归一为sampler_frame_count_grid的倍数（默认为 6），然后在每个 bin 内，样本在每个 epoch 进行洗牌。
- config/base.yaml 中的 dataloader_prefetch_factor：PyTorch 的 DataLoader 设置（请参阅 PyTorch 文档）。
多 GPU 上的有效批量大小和梯度累积：
- max_tokens 和 max_sentences 始终控制单个设备的Batch大小。然后将有效批量大小全部相加。
- accumulate_grad_batches 允许您在梯度下降之前反向传播多个批次，从而增加有效批次大小。
- 例如（假设 max_sentences 占主导地位），你有 4 个 GPU，max_sentences=8，accumulate_grad_batches=2。那么有效批量大小 4*8*2=64。

Use official PyTorch Lightning 2.0 as the framework, and adapted base_task and acoustic_task to it.
- Automatic multiple-node training (CPU, single-GPU, multi-GPU, etc).
- Now support training with other precision settings, e.g., 'bf16'.
- Gradient Accumulation works correctly.
- Checkpointing with K most recent validations, and supports the original permanent checkpoint and interval.
- Re-coded batch sampler and its distributed version as subclasses of PyTorch's Sampler class.
- Sample shuffling with grouping by similar frame count is reimplemented.
- Adapted multiple main process discrimination codes to use PL's rank_zero utility.
Upgrade to PyTorch 2.0, support for torch.compile is yet to be tested.
- For compatibility with most deep learning service providers, PyTorch 1.12/1.13 is also supported, however, Lightning needs to be version 2.0.0
- (In scripts/train.py, environment variable TORCH_CUDNN_V8_API_ENABLED is set to prevent excessive slowdown when using 16-bit precision. If it causes any problem, try to comment it out.)
Use HDF5 as the new binarized dataset format to avoid potential file handle sharing and race to perform file seeking.
Updated data preparation notebook to reflect several PL related parameters.

New parameter explanation:

PyTorch Lightning related:
- Exposed pl_trainer_accelerator, pl_trainer_devices, pl_trainer_num_nodes, pl_trainer_strategy, and pl_trainer_precision, see the PyTorch Lightning 2.0 doc for their usage in Trainer section.
- For pl_trainer_devices, it can be:
- pl_trainer_devices: 'auto': Auto select
- pl_trainer_devices: 2: Use two accelerators, auto select
- pl_trainer_devices: [2, 3]: Use accelerator number 2 and 3
- ddp_backend, choose from 'gloo', 'nccl', or 'nccl_no_p2p'.
DataLoader related:
- sampler_frame_count_grid in config/base.yaml: Now random shuffling of samples with similar sizes is correctly supported. First, the sample length is rounded to multiples of sampler_frame_count_grid (default 6), and then within each bin, samples are shuffled every epoch.
- dataloader_prefetch_factor in config/base.yaml: Setting for PyTorch's DataLoader (refer to PyTorch doc).
Effective batch size on Multi-GPU and gradient accumulation:
- max_tokens and max_sentences always control batching for a single device. Effective batch size is then all of them summed.
- accumulate_grad_batches allows you to backpropagate multiple batches before a gradient descent, effectively increasing the batch size.
- Combining them, for example (suppose max_sentences is dominant), you have 4 GPUs, max_sentences=8, accumulate_grad_batches=2. Then you have effective batch size 4*8*2=64.

yqzhishen commented 1 year ago

I formerly wrote a Python program calling git grep to search for config keys that are never used in other codes: search_config2.txt (renamed to .txt because GitHub does not support uploading .py files). You can use it to clean unused config keys after refactoring.

yqzhishen commented 1 year ago

This PR is ready to be merged after the final fixes and some simple tests on the incoming branch.

By the way, license of the refactor-v2 branch was formerly changed to Apache 2.0, which will be the new license of our forked DiffSinger once refactor-v2 is merged into the main branch. With your agreement, your contributions will also be licensed under Apache 2.0 in this repository.

hrukalive commented 1 year ago

My consent, thanks.

yqzhishen commented 1 year ago

Due to some unresolved performance issues during tests, this branch will be merged into a temporary branch. It should be merged into the main branch after these issues are addressed.

hrukalive commented 1 year ago

Performance is tightly linked to the grid resolution when performing shuffling and sorting by similar lengths on samples. When fully sorted, the performance does not drop compared to the original codebase.

yqzhishen commented 1 year ago

Performance issues addressed so I changed the base branch back to refactor-v2.

openvpi / DiffSinger

Refactor to support PyTorch 2.0 and Lightning 2.0 #72