jark006 / SummerTTS_VS

SummerTTS 是一个基于C++的独立编译的中文和英文语音合成项目,可以本地运行不需要网络,而且没有额外的依赖,一键编译完成即可用于中文和英文的语音合成。SummerTTS is a standalone Chinese and English speech synthesis(TTS) project that has almost no dependency and could be easily used for Chinese TTS with just one key build out
9 stars 4 forks source link

遇到个别汉字会报错崩溃 #3

Open nanfei01055 opened 3 weeks ago

nanfei01055 commented 3 weeks ago

比如以下代码,只有一个“啊”字,它崩溃了,但以我的水平无法弄清楚是什么原因

#include <iostream>
#include <fstream>
#include <cstring>
#include <cstdio>
#include <cstdlib>
#include <fcntl.h>
#include <windows.h>
#include "SynthesizerTrn.h"
#include "utils.h"
#pragma comment(lib, "Winmm.lib")

using std::cout;
using std::cerr;
using std::endl;
using std::vector;

void convertAudioToWavBuf(
    char* toBuf,
    char* fromBuf,
    int totalAudioLen)
{
    char* header = toBuf;
    int byteRate = 16 * 16000 * 1 / 8;
    int totalDataLen = totalAudioLen + 36;
    int channels = 1;
    int  longSampleRate = 16000;

    header[0] = 'R'; // RIFF/WAVE header
    header[1] = 'I';
    header[2] = 'F';
    header[3] = 'F';
    header[4] = (char)(totalDataLen & 0xff);
    header[5] = (char)((totalDataLen >> 8) & 0xff);
    header[6] = (char)((totalDataLen >> 16) & 0xff);
    header[7] = (char)((totalDataLen >> 24) & 0xff);
    header[8] = 'W';
    header[9] = 'A';
    header[10] = 'V';
    header[11] = 'E';
    header[12] = 'f'; // 'fmt ' chunk
    header[13] = 'm';
    header[14] = 't';
    header[15] = ' ';
    header[16] = 16; // 4 bytes: size of 'fmt ' chunk
    header[17] = 0;
    header[18] = 0;
    header[19] = 0;
    header[20] = 1; // format = 1
    header[21] = 0;
    header[22] = (char)channels;
    header[23] = 0;
    header[24] = (char)(longSampleRate & 0xff);
    header[25] = (char)((longSampleRate >> 8) & 0xff);
    header[26] = (char)((longSampleRate >> 16) & 0xff);
    header[27] = (char)((longSampleRate >> 24) & 0xff);
    header[28] = (char)(byteRate & 0xff);
    header[29] = (char)((byteRate >> 8) & 0xff);
    header[30] = (char)((byteRate >> 16) & 0xff);
    header[31] = (char)((byteRate >> 24) & 0xff);
    header[32] = (char)(1 * 16 / 8); // block align
    header[33] = 0;
    header[34] = 16; // bits per sample
    header[35] = 0;
    header[36] = 'd';
    header[37] = 'a';
    header[38] = 't';
    header[39] = 'a';
    header[40] = (char)(totalAudioLen & 0xff);
    header[41] = (char)((totalAudioLen >> 8) & 0xff);
    header[42] = (char)((totalAudioLen >> 16) & 0xff);
    header[43] = (char)((totalAudioLen >> 24) & 0xff);

    memcpy(toBuf + 44, fromBuf, totalAudioLen);

}

int main(int argc, char** argv) {
    float* dataW = NULL;
    int32_t modelSize = ttsLoadModel((char*)"d:\\single_speaker_fast.bin", &dataW);

    SynthesizerTrn* synthesizer = new SynthesizerTrn(dataW, modelSize);
    int32_t spkNum = synthesizer->getSpeakerNum();

    int32_t retLen = 0;
    int16_t* wavData = synthesizer->infer("啊", 0, 1.0, retLen);

    char* dataForFile = new char[retLen * sizeof(int16_t) + 44];
    convertAudioToWavBuf(dataForFile, (char*)wavData, retLen * sizeof(int16_t));

    PlaySound((LPCSTR)dataForFile, 0, SND_MEMORY | SND_SYNC);

    FILE* fpOut = fopen("d:\\out.wav", "wb");
    fwrite(dataForFile, retLen * sizeof(int16_t) + 44, 1, fpOut);
    fclose(fpOut);

    delete[] dataForFile;  // 释放 dataForFile 对象
    tts_free_data(wavData);  // 释放 wavData 对象
    delete synthesizer;  // 释放 synthesizer 对象
    tts_free_data(dataW);  // 释放 dataW 对象

    return 0;
}
nanfei01055 commented 3 weeks ago

补充一下,我把它封装成c#调用,控制台打出的错误消息是:ERROR: StringFstToOutputLabels: Invalid start state,希望能对解决这个问题有帮助

jark006 commented 3 weeks ago

补充一下,我把它封装成c#调用,控制台打出的错误消息是:ERROR: StringFstToOutputLabels: Invalid start state,希望能对解决这个问题有帮助

我也找不到原因,多个“啊”也会出错,估计这个字在内部处理时发生问题了,但是存在其他有效字符时,可以正常推导音频。所以字符串可以尾部加个句号“。”,避免推导时运行出错。

jark006 commented 1 week ago

原因出在SynthesizerTrn::infer函数,其在预处理文本时,文本line转到tagged_text时就漏掉"啊"字符了,暂不清楚还有没有其他字也漏,调试半天也找不到原因。我根据其最终处理结果的tnString,写了个替代的预处理函数TextSet::filterChineseAndNum,其实就是过滤汉字和数字而已。

现在“啊”字没问题了,测试了几段长文本,感觉没有问题了(实际上难说,我也没有大量测试)。

详情:修复部分汉字不发声(未完全测试,效果待定)