lucidrains / naturalspeech2-pytorch

Implementation of Natural Speech 2, Zero-shot Speech and Singing Synthesizer, in Pytorch
MIT License
1.26k stars 100 forks source link

Trainer support for audio file, prompt pairs #20

Open deepglugs opened 1 year ago

deepglugs commented 1 year ago

Most of my data is split in file.wav and file.txt or in json files with "path/to/file.wav": "the transcription of the audio" mappings. It looks like Trainer only supports audio files. Is there a way to get prompt support?

deepglugs commented 1 year ago

I'm attempting to give this feature a try but I'm confused about the prompt input. process_prompt expects prompt.ndim==2 if it's a "raw prompt". In my brain a prompt is a single dimension of text: ["this is the prompt"]. What is the other dimension for?

lexkoro commented 1 year ago

(batch, embedding) would be = 2. Btw. this is still WIP don't think it is trainable.