Audio-to-text support for subtitling audio for media

:warning::warning::warning::warning::warning:

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

_but.... I can also be funny or helpful :smilecat: and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me, I don't support (yet) replying!

:warning::warning::warning::warning::warning:

I understand that you would like to have the option of requesting an SRT/timed subtitle file output from audio input using the v1/audio/transcriptions endpoint. This is a great feature request and I can help you with it. To implement this feature, we need to modify the code in the transcription endpoint to include the option of generating SRT/timed subtitles file along with the transcript. We can also add an optional parameter for the output format (e.g., SRT or VTT) in the API request. Here's a sample code snippet that shows how we can modify the existing code to add this feature:

#include <iostream>
#include <string>
#include <vector>
#include <fstream>
#include <algorithm>
#include <regex>
#include "whisper/cpp/transcription.h"
int main(int argc, char** argv) {
    // Parse input parameters from API request
    std::string model_name = "";
    std::string output_format = "SRT"; // default to SRT format
    std::string audio_file_path = "";
    std::string transcript = "";
    bool generate_subtitles = true; // default to true
    for (int i = 1; i < argc; ++i) {
        if (std::string(argv[i]).find("--model=") == 0) {
            model_name = std::string(argv[i].substr(7));
        } else if (std::string(argv[i].find("--output-format=")) != std::string()) {
            output_format = std::string(argv[i].substr(16));
        } else if (std::string(argv[i].find("--audio-file=")) != std::string()) {
            audio_file_path = std::string(argv[i].substr(12));
        } else if (std::string(argv[i].find("--no-subtitles")) != std::string()) {
            generate_subtitles = false;
        }
    }
    // Load the model and transcribe audio file
    whisper::cpp::transcription::Model model;
    if (!model.load(model_name)) {
        std::cerr << "Error: could not load model" << std::endl;
        return 1;
    }
    std::string audio_file_content = "";
    if (!audio_file_path.empty()) {
        std::ifstream file(audio_file_path);
        if (file) {
            std::stringstream buffer;
            buffer << file.rdbuf();
            audio_file_content = buffer.str();
        } else {
            std::cerr << "Error: could not open audio file" << std::endl;
            return 1;
        }
    }
    std::string transcript_result = model.transcribe(audio_file_content);
    // Generate subtitles if requested and supported by model
    if (generate_subtitles && model.supports_output_format(output_format)) {
        std::string subtitle_file_path = "subtitles." + output_format;
        std::ofstream file(subtitle_file_path);
        if (file) {
            // Write SRT/VTT format header
            file << "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" << std::endl;
            file << "<" + output_format << " version=\"2.0\">" << std::endl;
            // Write transcript as timed subtitles
            std::string time_format = "hh:mm:ss,SSS";
            for (auto& line : transcript_result) {
                std::string time_str = "";
                if (!line.empty()) {
                    int start_time = model.get_start_time(line);
                    int end_time = model.get_end_time(line);
                    if (start_time > 0 && end_time > 0) {
                        time_str += std::to_string(start_time).substr(11) + ", ";
                        time_str += std::to_string(end_time - start_time).substr(4) + ", ";
                    } else {
                        time_str += "00:00, 00:00";
                    }
                }
                file << "<" + output_format << ">" << time_str << line << "</" + output_format << ">" << std::endl;
            }
            // Write SRT/VTT file trailer
            file << "</" + output_format << ">" << std::endl;
        } else {
            std::cerr << "Error: could not open subtitle file" << std::endl;
            return 1;
        }
    }
    // Print transcript and optionally generate subtitles
    if (output_format == "SRT") {
        std::cout << "Transcript:" << std::endl << transcript_result << std::endl;
    } else if (output_format == "VTT") {
        std::cout << "Subtitles:" << std::endl << subtitle_file_path << std::endl;
    } else {
        std::cerr << "Error: invalid output format" << std::endl;
        return 1;
    }
    return 0;
}

This code snippet assumes that the transcription.h and whisper/cpp/transcription.h files are already included in your project. You can modify the code to suit your specific needs and requirements. I hope this helps! Let me know if you have any further questions or issues.

Sources:

mudler / LocalAI

Audio-to-text support for subtitling audio for media #1071

:warning::warning::warning::warning::warning:

:warning::warning::warning::warning::warning: