Many Words Cannot be Correctly Read

bxclib2 commented 2 months ago

prompt = "google is a great website to let you find your niche." description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device) prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids).to(torch.float32) audio_arr = generation.cpu().numpy().squeeze() sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

I find the word, "Google", "website", "niche", all cannot pronounced correctly usually. These words are not very rare.

https://github.com/user-attachments/assets/874d02dc-f4c3-432e-8731-e0dd17d442f9

ylacombe commented 1 month ago

That's an issue with:

the dataset that the model has been trained on - it's not diversed enough and thus some tokens were never seen
the tokenizer used could probably do better -> because it was trained for English, there are many English words that are tokenized as a whole token. Let's visualize your text example.

As you can see, google, website and niche are all corresponding to one token. I'd say the model is just trying to pronounce tokens it has never seen before.

With the current model architecture, I'm afraid it's not something we could solve easily

bxclib2 commented 1 month ago

This is a huge issue. Can we train another base model?

seanphan commented 1 month ago

Today I wanted to share some quick facts about Union Square in NYC. Did you know Union Square Park was once a cemetery? It was a potter’s field in the 1800s, but now it’s a vibrant public space hosting events, protests and the holiday market. The original Union Square Green Market started here in 1976 and kicked off the farmer’s market movement.

Same issue here: NYC, 1800s can't be read

Amazing model, please help to advise how to improve these.

huggingface / parler-tts

Many Words Cannot be Correctly Read #88