Multi-node batched validation and improvement on strategy selection

Multi-node and batched validation
1. Binarizer changes:
  1. Split train/valid set now preserves the text prefix order specified in config.
  2. Record all ndarray's first dimension into metadata (pickle).
  3. Also print duration for each speaker separately.
  4. File handle is closed in load_meta_data.
2. Task code changes:
  1. All devices participate in validation. And diffusion process accepts a batch of inputs.
  2. Custom DsTensorBoardLogger to support logging audio and images from multiple processes.
3. Plot changes: Added figsize for curves and pitch plots. Added title ("spk - item name")
Multi-device strategy selection: Much easier, does not poke PL internal logic but delegates to its registry. Using name, one could specify all available strategies, and pass other key-value pairs as kwargs. To disable NCCL P2P, now use a new option called nccl_p2p (defaults to true).
Changes due to side-effects and future considerations:
1. Dataset now admits size_key to specify other "sizes" from metadata for sorting samples.
2. Dataset now admits preload to load all samples into memory (for TPU).
3. DsBatchSampler is now unified as the only sampler. Several bug fixes on unwanted randomness, and correctly deals with an insufficient number of samples.
4. Dataset get_item and collator return the indices of the item for metadata retrieval.
5. Moved fine-tune related logic to build model.
6. Removed dynamic validation loss creation in validation step, hoisted to the init function and conformed validation metrics to the same place.

Notes

Even though DeepSpeed is now supported, the checkpoint produced is not in the usual format. Please use it at your own discretion.
FSDP does not support gradient clipping by norm.
This PR does require a re-binarization of the datasets.

openvpi / DiffSinger

Multi-node batched validation and improvement on strategy selection #148

Notes