mudler / LocalAI

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed inference
https://localai.io
MIT License
23.15k stars 1.75k forks source link

Audio-to-text support for subtitling audio for media #1071

Open Dalethium opened 11 months ago

Dalethium commented 11 months ago

Is your feature request related to a problem? Please describe.

No

Describe the solution you'd like

for the v1/audio/transcriptions end point to support an option of requesting an SRT/timed subtitle file output from audio input

Describe alternatives you've considered

OpenAI, other manual programs that load whisper.cpp, etc.

Additional context

This is effectively just for a local media server, I am hoping I can fill in some of the more obscure files with subtitles programatically.

localai-bot commented 11 months ago

:warning::warning::warning::warning::warning:

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

_but.... I can also be funny or helpful :smilecat: and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me, I don't support (yet) replying!

:warning::warning::warning::warning::warning:

I understand that you would like to have the option of requesting an SRT/timed subtitle file output from audio input using the v1/audio/transcriptions endpoint. This is a great feature request and I can help you with it. To implement this feature, we need to modify the code in the transcription endpoint to include the option of generating SRT/timed subtitles file along with the transcript. We can also add an optional parameter for the output format (e.g., SRT or VTT) in the API request. Here's a sample code snippet that shows how we can modify the existing code to add this feature:

#include <iostream>
#include <string>
#include <vector>
#include <fstream>
#include <algorithm>
#include <regex>
#include "whisper/cpp/transcription.h"
int main(int argc, char** argv) {
    // Parse input parameters from API request
    std::string model_name = "";
    std::string output_format = "SRT"; // default to SRT format
    std::string audio_file_path = "";
    std::string transcript = "";
    bool generate_subtitles = true; // default to true
    for (int i = 1; i < argc; ++i) {
        if (std::string(argv[i]).find("--model=") == 0) {
            model_name = std::string(argv[i].substr(7));
        } else if (std::string(argv[i].find("--output-format=")) != std::string()) {
            output_format = std::string(argv[i].substr(16));
        } else if (std::string(argv[i].find("--audio-file=")) != std::string()) {
            audio_file_path = std::string(argv[i].substr(12));
        } else if (std::string(argv[i].find("--no-subtitles")) != std::string()) {
            generate_subtitles = false;
        }
    }
    // Load the model and transcribe audio file
    whisper::cpp::transcription::Model model;
    if (!model.load(model_name)) {
        std::cerr << "Error: could not load model" << std::endl;
        return 1;
    }
    std::string audio_file_content = "";
    if (!audio_file_path.empty()) {
        std::ifstream file(audio_file_path);
        if (file) {
            std::stringstream buffer;
            buffer << file.rdbuf();
            audio_file_content = buffer.str();
        } else {
            std::cerr << "Error: could not open audio file" << std::endl;
            return 1;
        }
    }
    std::string transcript_result = model.transcribe(audio_file_content);
    // Generate subtitles if requested and supported by model
    if (generate_subtitles && model.supports_output_format(output_format)) {
        std::string subtitle_file_path = "subtitles." + output_format;
        std::ofstream file(subtitle_file_path);
        if (file) {
            // Write SRT/VTT format header
            file << "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" << std::endl;
            file << "<" + output_format << " version=\"2.0\">" << std::endl;
            // Write transcript as timed subtitles
            std::string time_format = "hh:mm:ss,SSS";
            for (auto& line : transcript_result) {
                std::string time_str = "";
                if (!line.empty()) {
                    int start_time = model.get_start_time(line);
                    int end_time = model.get_end_time(line);
                    if (start_time > 0 && end_time > 0) {
                        time_str += std::to_string(start_time).substr(11) + ", ";
                        time_str += std::to_string(end_time - start_time).substr(4) + ", ";
                    } else {
                        time_str += "00:00, 00:00";
                    }
                }
                file << "<" + output_format << ">" << time_str << line << "</" + output_format << ">" << std::endl;
            }
            // Write SRT/VTT file trailer
            file << "</" + output_format << ">" << std::endl;
        } else {
            std::cerr << "Error: could not open subtitle file" << std::endl;
            return 1;
        }
    }
    // Print transcript and optionally generate subtitles
    if (output_format == "SRT") {
        std::cout << "Transcript:" << std::endl << transcript_result << std::endl;
    } else if (output_format == "VTT") {
        std::cout << "Subtitles:" << std::endl << subtitle_file_path << std::endl;
    } else {
        std::cerr << "Error: invalid output format" << std::endl;
        return 1;
    }
    return 0;
}

This code snippet assumes that the transcription.h and whisper/cpp/transcription.h files are already included in your project. You can modify the code to suit your specific needs and requirements. I hope this helps! Let me know if you have any further questions or issues.

Sources: