Voice Input Support for Ollama Models

Discussed in https://github.com/langchain-ai/langchain/discussions/27404

^{Originally posted by **kodychik** October 16, 2024} ### Checked - [X] I searched existing ideas and did not find a similar one - [X] I added a very descriptive title - [X] I've clearly described the feature request and motivation for it # Feature request We (a team of CS students at the University of Toronto) propose that we add voice input support to LangChain's Ollama models. # Motivation LangChain currently supports the best models via Ollama integration but lacks the ability to accept voice inputs on these Ollama models. This limitation restricts its use in voice-enabled applications such as virtual assistants, voice-controlled systems, and accessibility tools. This enhancement will enable developers to build applications that can process spoken language, expanding the ways users can interact with LangChain-powered systems. # Proposal ## Feasibility Analysis Feasible, it involves: ● Speech-to-Text Conversion: Using a speech recognition engine to transcribe voice inputs into text that the language model can process. ● Integration with Existing Pipelines: Modifying or extending existing chains to include a speech-to-text (STT) component before processing inputs with the LLM. ● Modular Implementation: Leveraging LangChain's modular architecture to add this functionality without significant changes to existing code. ## Outline of Changes Existing Architecture Overview LangChain's architecture consists of: ● LLMs (Language Models): Interfaces to language models via Ollama. ● Chains: Sequences of components (e.g., prompt templates, LLMs) that process inputs and generate outputs. ● Agents: Systems that use LLMs to perform tasks by making decisions and possibly interacting with tools. ● Retrievers and VectorStores: Components used in Retrieval-Augmented Generation (RAG) pipelines to fetch relevant information. ## Proposed Solution Introduce a Speech-to-Text Component that converts voice inputs into text, integrating seamlessly with existing LangChain chains and agents. 1. User Interaction: User provides voice input via microphone. 2. Speech-to-Text Conversion: ○ The STT component transcribes the voice input into text. 3. Text Processing: ○ The transcribed text is passed to existing LangChain chains or agents. 4. LLM Response: ○ The LLM generates a response based on the input text. 5. Output Delivery: ○ The response is delivered to the user (could be text or converted back to speech). ## Files to Modify and Create New Files: ● speech_to_text.py: Implements the SpeechToTextConverter class. Files to Modify: ● None as existing chains or agents will take text input generated from the STT component. ## Potential for Innovation: ● Speech from the user is given to the language model to perform engineering by the model. This prompt engineered output will be given to the Ollama Model chain through langchain to generate a response. This prevents prompts that are too unstructured and rambly as speech inputs usually can be. New Classes and Components 1. SpeechToTextConverter Class ○ Purpose: Converts voice input into text using a speech recognition engine. ○ Key Methods: ■ __init__(engine='whisper', **kwargs): Initializes the speech recognition engine. ■ convert(audio_input) -> str: Converts audio input to text. 2. VoiceInputChain Class ○ Purpose: A chain that processes voice inputs by integrating the STT component and passing the text to the LLM. ○ Key Methods: ■ __init__(stt_converter, llm_chain): Initializes with an STT converter and an existing LLM chain. ■ run(audio_input) -> str: Processes the audio input through the STT converter and LLM chain. ## Pseudocode Implementation ``` speech_to_text.py class SpeechToTextConverter: def __init__(self, engine='whisper', **kwargs): if engine == 'whisper': #Initialize Whisper model self.model = load_whisper_model(**kwargs) else: raise NotImplementedError("Only 'whisper' engine is currently supported.") def convert(self, audio_input) -> str: #Convert audio to text using the selected engine text = self.model.transcribe(audio_input) return text voice_input_chain.py class VoiceInputChain(Chain): def __init__(self, stt_converter, llm_chain): self.stt_converter = stt_converter self.llm_chain = llm_chain def run(self, audio_input) -> str: #Step 1: Convert voice input to text text_input = self.stt_converter.convert(audio_input) #Step 2: Pass text to the LLM chain response = self.llm_chain.run(text_input) return response ``` ## Implementation Steps 1. Develop the Speech-to-Text Component ○ Implement the SpeechToTextConverter class. ○ Use OpenAI's Whisper model or another suitable STT engine. ○ Allow for future expansion to support other engines. 2. Create the Voice Input Chain ○ Implement the VoiceInputChain class. ○ Integrate the STT converter with an existing LLM chain. 3. Testing ○ Write unit tests for the new components. ○ Test with various audio inputs to ensure accurate transcription and appropriate LLM responses. 4. Documentation ○ Document new classes, methods, and usage examples. ○ Provide guidelines on setting up dependencies and handling potential issues. ## Example Usage ``` #Import necessary modules from langchain.llms import Ollama from langchain.chains import LLMChain from speech_to_text import SpeechToTextConverter from voice_input_chain import VoiceInputChain #Initialize the speech-to-text converter stt_converter = SpeechToTextConverter(engine='whisper', model_size='base') #Initialize the LLM chain with Llama 3.1 via Ollama llm = Ollama(model='llama-3.1') llm_chain = LLMChain(llm=llm) #Create the voice input chain voice_chain = VoiceInputChain(stt_converter=stt_converter, llm_chain=llm_chain) #Use the chain with an audio file or audio stream audio_input = 'path/to/audio.wav' # Can be a file path or audio data response = voice_chain.run(audio_input) #Output the LLM's response print(response) ``` # Final Remarks By implementing this feature: ● We address the growing demand for voice-enabled applications. ● LangChain becomes more versatile, appealing to a broader developer audience. ● The modular design ensures maintainability and ease of future enhancements.

langchain-ai / langchain

Voice Input Support for Ollama Models #27717

Discussed in https://github.com/langchain-ai/langchain/discussions/27404