netease-youdao / EmotiVoice

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine
Apache License 2.0
6.63k stars 555 forks source link

使用命令行推理报错,张量维度不一致 #138

Closed aidway closed 3 months ago

aidway commented 3 months ago

python tts.py \

                                --speaker 8051   \
                                --text '1个平台,1套AI应用建设流程及1个AI专业团队,并在我司典型的大模型应用场景完成孵化试点支持IT智能客服、400通话质检、文本摘要、图文转换、代码生成、会议纪要等多个场景的智能应用。显著提升了办公效率,助力企业实现数智化转型。' \
                                --audio_name my_audio2

Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache Loading model cost 0.604 seconds. Prefix dict has been built successfully. Traceback (most recent call last): File "tts.py", line 201, in emotivoice_tts(speaker, text, audio_name + "." + audio_type) File "tts.py", line 184, in emotivoice_tts data = tts(speaker, text, prompt, text, speaker, models) File "tts.py", line 129, in tts content_embedding = get_style_embedding(content, tokenizer, style_encoder) File "tts.py", line 116, in get_style_embedding output = style_encoder( File "/root/anaconda3/envs/meta_human/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/aidb/code/meta_human/tts/EmotiVoice-main/models/prompt_tts_modified/simbert.py", line 49, in forward outputs = self.bert( File "/root/anaconda3/envs/meta_human/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/root/anaconda3/envs/meta_human/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 1006, in forward embedding_output = self.embeddings( File "/root/anaconda3/envs/meta_human/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/root/anaconda3/envs/meta_human/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 238, in forward embeddings += position_embeddings RuntimeError: The size of tensor a (536) must match the size of tensor b (512) at non-singleton dimension 1

代码是基于demo_page.py做了一点修改: def emotivoice_tts(speaker, text, filename , prompt='开心', lang='zh_us'): text = g2p_cn_en(text, g2p, lexicon) data = tts(speaker, text, prompt, text, speaker, models) sample_rate=config.sampling_rate write('/aidb/code/meta_human/tts/EmotiVoice-main/tts_output/' + filename, sample_rate, data.astype(np.int16))

if name == 'main':

1. get params

speaker, text, audio_name, audio_type, rate, volume = get_params()

# 2. tts
emotivoice_tts(speaker, text, audio_name + "." + audio_type)
aidway commented 3 months ago

已解决,传参错了