rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
4.37k stars 297 forks source link

Trying to format a dataset I've already prepped #447

Open JasonBlain opened 1 month ago

JasonBlain commented 1 month ago

Hey, thanks for previous assistance re: ONNX output modify. Worked great.

I'm working on a custom accent module with its own base dataset so I can't lean on any of the existing checkpoints. I've already curated all my clips and the text of them and reverse-phonemized it using a routine from the piper-onnx repo since I was able to contact it inside unity to interpret text back into phonemes live.

The thing is, since it's an accented voice, I had to go and do custom IPA transcription for some words, and also there are a lot of homographs I am trying to disambiguate very deliberately in my dataset, "read" past vs present tense "lead" "conflict" "separate", as well as US-regional pronunciations that are not standard to either typical US or UK pronunciations.

My main question is: What's the format for the preprocess output? I can't find a good metadata.csv example anywhere in the checkpoints. I need to write my own csv and .. jsonl potentially?

I've already got a bunch of things preformatted in .txt files like VITS would take them with lines similar to the following:

bjSpeech_EnglishOnyms-0.wav|Here are some Phonemes you should know:|hˈɪɹ ɑːɹ sˌʌm fˈoʊniːmz juː ʃˈʊd nˈoʊ:

Which, I can redo any way, but I as far as I can tell, I should write the document myself and then use that in the train input config, rather than preprocessing with the python script which may not catch all the grammar and pronunciation tricks I purposefully included to disambiguate tone/accent/inflection/word sense, and thus destroying or overwriting my custom IPA with the wrong IPA.

Please advise on:

  1. The proper format for the metadata.csv
  2. Do I need to have that sitting in a folder which has a subfolder /wav with all the files?
  3. Drop the file suffix .wav from the metadata.csv list since it's implied in the config?
  4. What's the proper training config format? I'm coming from a recent successful mini-test with VITS but that config.json doesn't match what you're using

Let me know, I feel like I'm close to being able to train a whole new custom model, I just can't really use the preprocess routine or checkpoints so I'm high and dry on example documents - the training guide isn't clear on the format of the output lines as they should display in a given document. Would be nice to just have an example line in the docs showing what Piper is actually expecting.

JasonBlain commented 1 month ago

Or is it a matter of finding the right insertion point to input the lines and their IPA ?

I see in the checkpoint files of various models the jsonl lines reference a spectrogram.pt type file which, I don't have that beforehand obviously, but my IPA and file paths should all be ready to go