schreibfaul1 / ESP32-audioI2S

Play mp3 files from SD via I2S
GNU General Public License v3.0
1.1k stars 285 forks source link

Can you help me write the code to call Doubao TTS? #856

Open Explorerlowi opened 3 weeks ago

Explorerlowi commented 3 weeks ago

My programming skills are so poor that I really can’t do it (灬ꈍ ꈍ灬). Here is the relevant document: https://www.volcengine.com/docs/6561/79823 I will pay you a certain amount of compensation. Thank you very much if you can do it!

Explorerlowi commented 1 week ago

The content returned by Doubao TTS request is not directly a binary audio stream. Its audio data is stored in the data field of a json structure. It is base64 encoded data. Binary audio data can only be obtained after base64 decoding. In this case, how to play it 5227de0688c11667cb49d6be8e00967d image

Explorerlowi commented 1 week ago

bool Audio2::connectToDoubaoTTS(const char *text) { xSemaphoreTakeRecursive(mutex_audio, portMAX_DELAY);

setDefaults();

const char *host = "openspeech.bytedance.com";
const char *api_url = "/api/v1/tts";

const char *appid = "82505*****";
const char *access_token = "WZCBgLbSd-ltw5gDeKvEYX9M******";
const char *cluster = "volcano_tts";
const char *voice_type = "BV001_streaming";

// Create JSON request
DynamicJsonDocument doc(1024); // Adjust size as necessary
JsonObject app = doc.createNestedObject("app");
app["appid"] = appid;
app["token"] = access_token;
app["cluster"] = cluster;

JsonObject user = doc.createNestedObject("user");
user["uid"] = "388808087185088";

JsonObject audio = doc.createNestedObject("audio");
audio["voice_type"] = voice_type;
audio["encoding"] = "mp3";
audio["speed_ratio"] = 1.0;
audio["volume_ratio"] = 1.0;
audio["pitch_ratio"] = 1.0;

JsonObject request = doc.createNestedObject("request");
request["reqid"] = String(uuid()); // Generate UUID
request["text"] = text;
request["text_type"] = "plain";
request["operation"] = "query";
request["with_frontend"] = 1;
request["frontend_type"] = "unitTson";

// Prepare JSON payload
String json_payload;
serializeJson(doc, json_payload);

// Connect to the server
_client = static_cast<WiFiClientSecure *>(&clientsecure);
if (!_client->connect(host, 443)) { // Use 443 for HTTPS
    log_e("Connection failed");
    xSemaphoreGiveRecursive(mutex_audio);
    return false;
}

// Create and send HTTP POST request
_client->println("POST " + String(api_url) + " HTTP/1.1");
_client->println("Host: " + String(host));
_client->println("Authorization: Bearer; " + String(access_token));
_client->println("Content-Type: application/json");
_client->println("Content-Length: " + String(json_payload.length()));
_client->println(); // End of headers

// Send JSON payload
_client->print(json_payload);

Serial.println(json_payload);
// Read the response
/*String response = "";
while (_client->connected() || _client->available()) {
    if (_client->available()) {
        char c = _client->read();
        response += c;
    }
}

// Process the response
if (response.indexOf("\"data\"") != -1) {
    // Parse the JSON response to get the data
    DynamicJsonDocument responseDoc(1024); // Adjust size as needed
    deserializeJson(responseDoc, response);
    const char* data = responseDoc["data"];
    // Here you would base64 decode the data and handle the audio
    // Remember to consider the necessary libraries or methods to handle audio output
} else {
    log_e("No data in response");
}

_client->stop();*/
m_streamType = ST_WEBFILE;
Serial.print("play speech: ");
Serial.println(m_streamType);
isplaying = 1;
m_f_running = true;
m_f_ssl = false;
m_f_tts = true;
setDatamode(HTTP_RESPONSE_HEADER);
xSemaphoreGiveRecursive(mutex_audio);
return true;

}

// Method to generate UUID (simple implementation) String Audio2::uuid() { uint32_t uid = esp_random(); // Random number as a placeholder for UUID generation return String(uid, HEX); }

This is my current code.

schreibfaul1 commented 1 week ago

With a "normal audio stream", the data would be written to the buffer here. InBuff.getWritePtr() is the pointer to the position from which the data is written bytesAddedToBuffer contains the number of bytes actually written. Then the conversion from base64 would have to be done. image

You don't need to worry about the rest, if it is an MP3 stream, for example, the ID3 header is automatically loaded when the buffer is full enough and the file is played.

Explorerlowi commented 1 week ago

Can you teach me how to parse and play the returned audio stream after sending an http(s) request to TTS? For example, which functions will work after the content is returned, and how will the returned content be processed? I can only play Baidu TTS now. When I send a request to Doubao TTS, Ali TTS, etc., I cannot parse and play the returned content normally.