English | 简体中文
SOFA (Singing-Oriented Forced Aligner) is a forced alignment tool designed specifically for singing voice.
On singing data, SOFA has the following advantages over MFA (Montreal Forced Aligner):
git clone
to download the code from this repository3.8
conda create -n SOFA python=3.8 -y
conda activate SOFA
pip install -r requirements.txt
Download the model file. You can find the trained models in the pretrained model sharing category of the discussion section, with the file extension .ckpt
.
Place the dictionary file in the /dictionary
folder. The default dictionary is opencpop-extension.txt
Prepare the data for forced alignment and place it in a folder (by default in the /segments
folder), with the following format
- segments
- singer1
- segment1.lab
- segment1.wav
- segment2.lab
- segment2.wav
- ...
- singer2
- segment1.lab
- segment1.wav
- ...
Ensure that the .wav
files and their corresponding .lab
files are in the same folder.
The .lab
file is the transcription for the .wav
file with the same name. The file extension for the transcription can be changed using the --in_format
parameter.
After the transcription is converted into a phoneme sequence by the g2p
module, it is fed into the model for alignment.
For example, when using the DictionaryG2P
module and the opencpop-extension
dictionary by default, if the content of the transcription is: gan shou ting zai wo fa duan de zhi jian
, the g2p
module will convert it based on the dictionary into the phoneme sequence g an sh ou t ing z ai w o f a duan d e zh ir j ian
. For how to use other g2p
modules, see g2p module usage instructions.
Command-line inference
Use python infer.py
to perform inference.
Parameters that need to be specified:
--ckpt
: (must be specified) The path to the model weights;--folder
: The folder where the data to be aligned is stored (default is segments
);--in_format
: The file extension of the transcription (default is lab
);--out_formats
: The annotation format of the inferred files, multiple formats can be specified, separated by commas (default is TextGrid,htk,trans
).--save_confidence
: Output confidence scores.--dictionary
: The dictionary file (default is dictionary/opencpop-extension.txt
);python infer.py --ckpt checkpoint_path --folder segments_path --dictionary dictionary_path --out_formats output_format1,output_format2...
Retrieve the Final Annotation
The final annotation is saved in a folder, the name of which is the annotation format you have chosen. This folder is located in the same directory as the wav files used for inference.
-m
during inference. It finds the most probable contiguous sequence segment within the given phoneme sequence, rather than having to use all the phonemes.Place the training data in the data
folder in the following format:
- data
- full_label
- singer1
- wavs
- audio1.wav
- audio2.wav
- ...
- transcriptions.csv
- singer2
- wavs
- ...
- transcriptions.csv
- weak_label
- singer3
- wavs
- ...
- transcriptions.csv
- singer4
- wavs
- ...
- transcriptions.csv
- no_label
- audio1.wav
- audio2.wav
- ...
Regarding the format of transcriptions.csv
, see: https://github.com/qiuqiao/SOFA/discussions/5
Where:
transcriptions.csv
only needs to have the correct relative path to the wavs
folder;
The transcriptions.csv
in weak_label
does not need to have a ph_dur
column;
binarize_config.yaml
as needed, then execute python binarize.py
;train_config.yaml
as needed, then execute python train.py -p path_to_your_pretrained_model
;tensorboard --logdir=ckpt/
.To measure the performance of a model, it is useful to calculate some objective evaluation metrics between the predictions (force-aligned labels) and the targets (manual labels), especially in a k-fold cross-validation.
Some useful metrics are:
To evaluate your model on a specific dataset, please first run the inference to get all predictions. You should put your predictions and targets in different folders, with same filenames and relative paths, containing the same phone sequences except for spaces. The script only supports TextGrid format currently.
Run the following command:
python evaluate.py <PRED_DIR> <TARGET_DIR> -r -s
where PRED_DIR
is a directory containing all predictions and TARGET_DIR
is a directory containing all targets.
Options:
-r
, --recursive
: compare the files in subdirectories recursively-s
, --strict
: use strict mode (raise errors instead of skipping if the phones are not identical)--ignore
: ignore some phone marks (default: AP,SP,<AP>,<SP>,,pau,cl
)The script will calculate: