Batch evaluation inference scripts

This issue (and its corresponding PR) will add evaluate.py, which applies a model on all items from an existing dataset.

Motivation

Sometimes people just want to hear what if songs of one person are sung in the voice of another person, like what most SVC pipelines actually do.

Also, this may be useful finding edge cases or labeling mistakes (so that the most severe mistakes can be fixed quickly), especially with the help of SVC pipelines.

Possible workflow to fully evaluate a dataset might be:

Run evaluate.py with a model trained from accurately labeled datasets (the reference model) on the target dataset and get the reference audio samples.
(Optional but recommended) Train a SVC model on the target dataset and apply it to the evaluation results to convert their timbre to that of the target dataset.
Calculate the mel spectrogram likelihood between the reference samples and the original recordings in the target dataset.
Sort the likelihood values in ascending order and look into the items with the lowest likelihood.

TODO

[ ] Support mel likelihood metrics during acoustic model training.
[x] Make binarizers reusable, for extracting parameters without saving them.
[ ] Implement testing-related methods in task classes. Should be able to generate both audio samples and likelihood values.
[ ] Add the main CLI entry, which is evaluate.py, with proper control options.

openvpi / DiffSinger

Batch evaluation inference scripts #155

Motivation

TODO