Open bxclib2 opened 2 months ago
That's an issue with:
As you can see, google
, website
and niche
are all corresponding to one token. I'd say the model is just trying to pronounce tokens it has never seen before.
With the current model architecture, I'm afraid it's not something we could solve easily
This is a huge issue. Can we train another base model?
Today I wanted to share some quick facts about Union Square in NYC. Did you know Union Square Park was once a cemetery? It was a potter’s field in the 1800s, but now it’s a vibrant public space hosting events, protests and the holiday market. The original Union Square Green Market started here in 1976 and kicked off the farmer’s market movement.
Same issue here: NYC, 1800s can't be read
Amazing model, please help to advise how to improve these.
prompt = "google is a great website to let you find your niche." description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device) prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids).to(torch.float32) audio_arr = generation.cpu().numpy().squeeze() sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
I find the word, "Google", "website", "niche", all cannot pronounced correctly usually. These words are not very rare.
https://github.com/user-attachments/assets/874d02dc-f4c3-432e-8731-e0dd17d442f9