huggingface / parler-tts

Inference and training library for high-quality TTS models.
Apache License 2.0
2.87k stars 294 forks source link

run_parler_tts_training.py gives datasets.table.CastError error and failure #57

Open duringleaves opened 1 month ago

duringleaves commented 1 month ago

I've run through the steps to train a single voice, and it goes well until it comes time to actually Fine-tuning Parler-TTS step, i'm hitting a wall. It seems the previous Dataset Annotation instructions don't create all of the expected values?

File "/Users/durin/AI/Projects/.parler-env/lib/python3.12/site-packages/datasets/table.py", line 2249, in cast_table_to_schema raise CastError( datasets.table.CastError: Couldn't cast text: string utterance_pitch_mean: float utterance_pitch_std: float snr: double c50: double speaking_rate: string phonemes: string noise: string reverberation: string speech_monotony: string -- schema metadata -- huggingface: '{"info": {"features": {"text": {"dtype": "string", "_type":' + 502 to {'text': Value(dtype='string', id=None), 'utterance_pitch_mean': Value(dtype='float32', id=None), 'utterance_pitch_std': Value(dtype='float32', id=None), 'snr': Value(dtype='float64', id=None), 'c50': Value(dtype='float64', id=None), 'speaking_rate': Value(dtype='string', id=None), 'phonemes': Value(dtype='string', id=None), 'noise': Value(dtype='string', id=None), 'reverberation': Value(dtype='string', id=None), 'speech_monotony': Value(dtype='string', id=None), 'audio': Audio(sampling_rate=44100, mono=True, decode=True, id=None)} because column names don't match

Just to confirm. when I run the previous step to view a sample from the dataset, here's the full contents: {'text': " Tonight at 11 on Utah's Talk Radio.", 'utterance_pitch_mean': 114.15099334716797, 'utterance_pitch_std': 30.472904205322266, 'snr': 60.48295974731445, 'c50': 59.44480895996094, 'speaking_rate': 'quite slowly', 'phonemes': " tʌnaɪt æt ɑn jutɔ'ɛs tɔk ɹeɪdioʊ . .", 'noise': 'slightly clear', 'reverberation': 'very confined sounding', 'speech_monotony': 'quite monotone', 'text_description': "'Very clear recording, but the speech is very monotone and slightly muffled by the recording.'"}

Not sure what I might be doing wrong, and I won't pretend to be an expert at this, so any guidance would be appreciated.

ylacombe commented 1 month ago

Hey @duringleaves, could you share the datasets you want to fine-tune the model on? Are you using one single dataset or multiple ones ?