hackzilla / SpeechRecognition

A simple yet powerful SwiftUI app for iOS that demonstrates speech recognition and text-to-speech synthesis in Swift. Users can listen to spoken English, view the transcription, copy the text, and use speech synthesis to read it aloud.
https://www.hackzilla.org
MIT License
2 stars 0 forks source link

ISSUE: Using TextEditor for Text Display #1

Open grabani opened 2 months ago

grabani commented 2 months ago

Hi Daniel,

I really hope you can spare some time to help me. I have spent an embarrassingly large amount of time trying to hack your code so that it will work with the TextEditor view.

I am able to successfully get spoken words displayed as text in the TextEditor view. I managed this even after enabling the recognitionRequest.shouldReportPartialResults = true in the recorder.swift file so that text appears in near real-time.

The issue I have is that, when a I pause from speaking (for a couple of seconds or so) the TextEditor window appears to clear all of its contents, and when I resume talking the TextEditor view starts to display my speech text again. I have tried many code permutations to get it to work but have failed miserably.

Can you please help.

My current ContentView.swift code is:

import SwiftUI

struct ContentView: View {
    @ObservedObject private var recorder = Recorder()
    @ObservedObject private var speechManager = SpeechManager()

    @State private var consoleText: String = ""  // Reintroduce the consoleText state variable
    @State private var circleColor: Color = .black

    @Environment(\.colorScheme) var colorScheme

    private let userDefaultsKey = "RecognizedText"

    init() {
        // Clear the RecognizedText key to ensure it's empty when the app starts
        UserDefaults.standard.removeObject(forKey: userDefaultsKey)
        // Initialize consoleText with the empty state or stored value from UserDefaults
        _consoleText = State(initialValue: UserDefaults.standard.string(forKey: userDefaultsKey) ?? "")
    }

    var body: some View {
        GeometryReader { geometry in
            VStack {
                ScrollView {
                    TextEditor(text: $consoleText)
                        .font(.system(.body, design: .monospaced))
                        .padding()
                        .background(Color(UIColor.systemBackground))
                        .cornerRadius(8)
                        .overlay(
                            RoundedRectangle(cornerRadius: 8)
                                .stroke(Color.gray, lineWidth: 1)
                        )
                        .frame(minHeight: 200, maxHeight: .infinity)
                        .frame(width: geometry.size.width * 0.9)  // Set width to 90% of the available width
                        .padding(.leading, geometry.size.width * 0.05)  // Indent from the left side
                }
                .frame(height: geometry.size.height * 0.7)

                HStack {
                    Circle()
                        .fill(circleColor)
                        .frame(width: 10, height: 10)
                        .padding()
                    Button(action: {
                        if (!recorder.isRecording) {
                            circleColor = .red
                            recorder.startRecording()
                            self.recorder.setPlayAndRecord()
                        } else {
                            circleColor = .black
                            recorder.stopRecording()
                            self.recorder.setPlayback()
                        }
                    }) {
                        Text(!recorder.isRecording ? "Start Listening" : "Stop Listening")
                            .foregroundColor(colorScheme == .light ? Color.white : Color.black)
                            .padding()
                            .background(
                                (recorder.hasMicrophoneAccess && recorder.isSpeechRecognizerAvailable) ?
                                Color.primary :
                                    Color.gray.opacity(0.6)
                            )
                            .overlay(
                                RoundedRectangle(cornerRadius: 8)
                                    .stroke(colorScheme == .dark ? Color.white.opacity(0.2) : Color.black.opacity(0.2), lineWidth: 1)
                            )
                            .cornerRadius(10)
                    }
                    .contentShape(Rectangle())
                    .disabled(
                        !recorder.hasMicrophoneAccess
                        || !recorder.isSpeechRecognizerAvailable
                    )
                    Button(action: {
                        // Clear UserDefaults when clearing the session
                        let clearText = "Session started \(formattedDate())\n\n"
                        UserDefaults.standard.set(clearText, forKey: userDefaultsKey)
                        consoleText = clearText
                    }) {
                        Text("Clear")
                            .foregroundColor(colorScheme == .light ? Color.white : Color.black)
                            .padding()
                            .background(Color.primary)
                            .overlay(
                                RoundedRectangle(cornerRadius: 8)
                                    .stroke(colorScheme == .dark ? Color.white.opacity(0.2) : Color.black.opacity(0.2), lineWidth: 1)
                            )
                            .cornerRadius(10)
                    }
                    .contentShape(Rectangle())
                }
                .frame(width: geometry.size.width * 0.9)  // Set width to 90% of the screen width
                Spacer()
            }
            .padding(.top, 75)
        }
        .onAppear {
            self.recorder.onRecognisedText = { [self] text in
                if text.isEmpty {
                    return
                }

                // Only store and display the final version of the recognized text
                // By replacing the previous content with the new one
                consoleText = text
                UserDefaults.standard.set(consoleText, forKey: userDefaultsKey)
            }

            self.speechManager.onFinishSpeaking = {
                self.recorder.setPlayAndRecord()
            }

            self.recorder.requestPermission()
        }
        .alert(isPresented: $recorder.showAlert) {
             Alert(title: Text(recorder.alertTitle), message: Text(recorder.alertMessage), dismissButton: .default(Text("OK")))
         }
    }
}

struct ContentView_Previews: PreviewProvider {    
    static var previews: some View {
        ContentView()
     }
}

func formattedDate() -> String {
    let formatter = DateFormatter()
    formatter.dateStyle = .medium
    formatter.timeStyle = .short
    return formatter.string(from: Date())
}
hackzilla commented 2 months ago

@grabani The issue you have is you are discarding the previous text. In my version, text was the current line spoken.

e.g. hello my name is Dan

By enabling partial recognition, you will get the line as it currently is.

e.g.

  1. hello
  2. hello my
  3. hello my name
  4. hello my name is
  5. hello my name is Dan

This gives the illusion that it is adding an extra word each time, but in reality its clearing it for it call. You can see this if you partial revert back to my code.

        .onAppear {
            self.recorder.onRecognisedText = { [self] text in
                if text.isEmpty {
                    return
                }

                // Only store and display the final version of the recognized text
                // By replacing the previous content with the new one
                consoleText = consoleText + "\n" + text
                UserDefaults.standard.set(consoleText, forKey: userDefaultsKey)
            }

I would suggest keeping track of whether the text is final or not and appending it to the previous final text.

if let result = result {
    self.onRecognisedText?(result.bestTranscription.formattedString, result.isFinal)
    print("Recognition: \(result.bestTranscription.formattedString)")
} else if let error = error {
grabani commented 2 months ago

Hi Daniel,

Thank you for taking the time to respond. I was unable to resolve my issue. If you could kindly continu eto support me with this issue I would be grateful.

Based on your feedback I updated the Recorder.swift file with the following:

if let result = result {
                        // Pass both the recognized text and the isFinal flag to the closure
                        self.onRecognisedText?(result.bestTranscription.formattedString, result.isFinal)
                        print("Recognition: \(result.bestTranscription.formattedString), Final: \(result.isFinal)")
                    } else if let error = error {
                        // Handle any errors here
                        print("Error during recognition: \(error.localizedDescription)")
                    }

However, as before, after an app launch and click on the "Start "listening" transcribed text appears on the screen in real-time (due to recognitionRequest.shouldReportPartialResults = true being enabled). If I then pause for around 2 seconds and then continue to talk the TextEditor view is cleared and the newly spoken words start to be transcribed on the screen.

Please find below my latest ContentView.swift and Recorder.swift code:

ContentView.swift

import SwiftUI

struct ContentView: View {
    @ObservedObject private var recorder = Recorder()
    @ObservedObject private var speechManager = SpeechManager()

    @State private var consoleText: String = ""  // Reintroduce the consoleText state variable
    @State private var circleColor: Color = .black

    @Environment(\.colorScheme) var colorScheme

    private let userDefaultsKey = "RecognizedText"

    init() {
        // Clear the RecognizedText key to ensure it's empty when the app starts
        UserDefaults.standard.removeObject(forKey: userDefaultsKey)
        if let savedText = UserDefaults.standard.string(forKey: userDefaultsKey) {
            print("UserDefaults content for key '\(userDefaultsKey)': \(savedText)")
        } else {
            print("No content found in UserDefaults at INIT for key '\(userDefaultsKey)'.")
        }
        // Initialize consoleText with the empty state or stored value from UserDefaults
        _consoleText = State(initialValue: UserDefaults.standard.string(forKey: userDefaultsKey) ?? "")
    }

    func logUserDefaultsContents(){
        if let savedText = UserDefaults.standard.string(forKey: userDefaultsKey) {
            print("UserDefaults content for key '\(userDefaultsKey)': \(savedText)")
        } else {
            print("No content found in UserDefaults for key '\(userDefaultsKey)'.")
        }
    }

    var body: some View {
        GeometryReader { geometry in
            VStack {
                ScrollView {
                    TextEditor(text: $consoleText)
                        .font(.system(.body, design: .monospaced))
                        .padding()
                        .background(Color(UIColor.systemBackground))
                        .cornerRadius(8)
                        .overlay(
                            RoundedRectangle(cornerRadius: 8)
                                .stroke(Color.gray, lineWidth: 1)
                        )
                        .frame(minHeight: 200, maxHeight: .infinity)
                        .frame(width: geometry.size.width * 0.9)  // Set width to 90% of the available width
                        .padding(.leading, geometry.size.width * 0.05)  // Indent from the left side
                }
                .frame(height: geometry.size.height * 0.7)

                HStack {
                    Circle()
                        .fill(circleColor)
                        .frame(width: 10, height: 10)
                        .padding()
                    Button(action: {
                        if (!recorder.isRecording) {
                            circleColor = .red
                            recorder.startRecording()
                            self.recorder.setRecord() // Updated to call setRecord()
                        } else {
                            circleColor = .black
                            recorder.stopRecording()
                            // No need to call setPlayback() if we're just stopping
                        }
                    }) {
                        Text(!recorder.isRecording ? "Start Listening" : "Stop Listening")
                            .foregroundColor(colorScheme == .light ? Color.white : Color.black)
                            .padding()
                            .background(
                                (recorder.hasMicrophoneAccess && recorder.isSpeechRecognizerAvailable) ?
                                Color.primary :
                                    Color.gray.opacity(0.6)
                            )
                            .overlay(
                                RoundedRectangle(cornerRadius: 8)
                                    .stroke(colorScheme == .dark ? Color.white.opacity(0.2) : Color.black.opacity(0.2), lineWidth: 1)
                            )
                            .cornerRadius(10)
                    }
                    .contentShape(Rectangle())
                    .disabled(
                        !recorder.hasMicrophoneAccess
                        || !recorder.isSpeechRecognizerAvailable
                    )
                    Button(action: {
                        // Clear UserDefaults when clearing the session
                        let clearText = "Session started \(formattedDate())\n\n"
                        UserDefaults.standard.set(clearText, forKey: userDefaultsKey)
                        consoleText = clearText
                    }) {
                        Text("Clear")
                            .foregroundColor(colorScheme == .light ? Color.white : Color.black)
                            .padding()
                            .background(Color.primary)
                            .overlay(
                                RoundedRectangle(cornerRadius: 8)
                                    .stroke(colorScheme == .dark ? Color.white.opacity(0.2) : Color.black.opacity(0.2), lineWidth: 1)
                            )
                            .cornerRadius(10)
                    }
                    .contentShape(Rectangle())
                }
                .frame(width: geometry.size.width * 0.9)  // Set width to 90% of the screen width
                Spacer()
            }
            .padding(.top, 75)
        }
        .onAppear {
            if let savedText = UserDefaults.standard.string(forKey: userDefaultsKey) {
                print("UserDefaults content at onAppear for key '\(userDefaultsKey)': \(savedText)")
            } else {
                print("No content found in UserDefaults at onAppear for key '\(userDefaultsKey)'.")
            }

            // Update `onRecognisedText` to handle both the recognized text and the isFinal flag
            self.recorder.onRecognisedText = { [self] text, isFinal in
                print("DEBUG:Recognized text received: '\(text)', Final: \(isFinal)")  // Log the recognized text

                if text.isEmpty {
                    return
                }

                if isFinal {
                    // For final results, append them with a newline (or other separator)
                    consoleText += "\n" + text
                    print("Used isFinal")
                } else {
                    // For partial results, replace the current line in `consoleText`
                    // This version appends the text in progress (partial result)
                    consoleText = text
                    print("Used Console")
                }

                // Update UserDefaults with the latest consoleText
                UserDefaults.standard.set(consoleText, forKey: userDefaultsKey)

                // Print UserDefaults content
                if let savedText = UserDefaults.standard.string(forKey: userDefaultsKey) {
                    print("UserDefaults content for key '\(userDefaultsKey)': \(savedText)")
                } else {
                    print("No content found in UserDefaults for key '\(userDefaultsKey)'.")
                }
            }

            self.speechManager.onFinishSpeaking = {
                self.recorder.setRecord() // Updated to call setRecord()
            }

            self.recorder.requestPermission()
        }
        .alert(isPresented: $recorder.showAlert) {
             Alert(title: Text(recorder.alertTitle), message: Text(recorder.alertMessage), dismissButton: .default(Text("OK")))
         }
    }
}

struct ContentView_Previews: PreviewProvider {    
    static var previews: some View {
        ContentView()
     }
}

func formattedDate() -> String {
    let formatter = DateFormatter()
    formatter.dateStyle = .medium
    formatter.timeStyle = .short
    return formatter.string(from: Date())
}

Recorder.swift

//
//  Recorder.swift
//  ChattyMarv
//
//  Created by Daniel Platt on 16/09/2023.
//

import SwiftUI
import AVFoundation
import Speech

class Recorder: ObservableObject {
    @Published var showAlert = false
    @Published var alertTitle = ""
    @Published var alertMessage = ""

    @Published var isRecording: Bool = false
    @Published var hasMicrophoneAccess: Bool = false
    @Published var alert: Alert?

    private var speechRecognizer = SFSpeechRecognizer()
    private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
    private var recognitionTask: SFSpeechRecognitionTask?
    private let audioEngine = AVAudioEngine()
    private let audioSession = AVAudioSession.sharedInstance()

    var onRecognisedText: ((String, Bool) -> Void)? // Updated to include the isFinal flag
    var onRecognisedSound: (() -> Void)?

    init() {
        // Configure the audio session
        print("recorder")
        //Ghulam:self.setRecord()
        startRecording()

    }

    func requestPermission() {
        if #available(iOS 17.0, *) {
            AVAudioApplication.requestRecordPermission { (hasPermission) in
                DispatchQueue.main.async {
                    self.hasMicrophoneAccess = hasPermission

                    if !self.isSpeechRecognizerAvailable {
                        self.alert = Alert(title: Text("Speech Recognition Unavailable"),
                                           message: Text("Please try again later."),
                                           dismissButton: .default(Text("OK")))
                    }
                }
            }
        } else {
            audioSession.requestRecordPermission { (hasPermission) in
                DispatchQueue.main.async {
                    self.hasMicrophoneAccess = hasPermission

                    if !self.isSpeechRecognizerAvailable {
                        self.alert = Alert(title: Text("Speech Recognition Unavailable"),
                                           message: Text("Please try again later."),
                                           dismissButton: .default(Text("OK")))
                    }
                }
            }
        }

        SFSpeechRecognizer.requestAuthorization { authStatus in

           OperationQueue.main.addOperation {
               switch authStatus {
                        case .denied:
                            self.updateAlert(title: "Access Denied", message: "User denied access to speech recognition")

                        case .restricted:
                            self.updateAlert(title: "Access Restricted", message: "Speech recognition restricted on this device")

                        case .notDetermined:
                            self.updateAlert(title: "Authorization Needed", message: "Speech recognition not yet authorized")

                        default:
                            break
                    }
           }
        }
    }

    func setRecord() {
        print("Entered the setRecord Function")
        do {
            try self.audioSession.setCategory(.record, mode: .default, options: [])
        } catch {
            print("Failed to set audio session category: \(error)")
        }
    }

    var isSpeechRecognizerAvailable: Bool {
        return speechRecognizer?.isAvailable ?? false
    }

    func startRecording() {
        // Request microphone access
        if (!self.hasMicrophoneAccess) {
            print("Microphone access denied")
            return
        }

        print("Start recording")
        do {
            self.setRecord() // Use the record category instead
            try self.audioSession.setActive(true, options: .notifyOthersOnDeactivation)
        } catch {
            print("Failed to set audio session category: \(error)")
        }

        print("Start recording - reset")

        // Reset the audio engine and the recognition task
        DispatchQueue.main.async {
            self.audioEngine.stop()
            self.recognitionTask?.cancel()

            // Change the UI state
            self.isRecording = true
            self.recognitionTask = nil
            self.recognitionRequest = nil

            // Create and configure the recognition request
            self.recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
            guard let recognitionRequest = self.recognitionRequest else {
                fatalError("Unable to create an SFSpeechAudioBufferRecognitionRequest object")
            }

            recognitionRequest.shouldReportPartialResults = true

            if self.speechRecognizer?.supportsOnDeviceRecognition == true {
                // Set requiresOnDeviceRecognition to true to enforce on-device recognition
                recognitionRequest.requiresOnDeviceRecognition = true
            } else {
                // Handle the case where on-device recognition is not supported
                print("On-device recognition not supported for the current language or device configuration.")
            }

            // Install the tap on the audio engine's input node
            print("Install the tap on the audio engine's input node")

            let recordingFormat = self.audioEngine.inputNode.outputFormat(forBus: 0)
            self.audioEngine.inputNode.installTap(onBus: 0, bufferSize: 4096, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
                DispatchQueue.main.async {
                    if !self.isRecording {
                        return
                    }

                    self.recognitionRequest?.append(buffer)
                }
            }

            // Start the audio engine
            do {
                try self.audioEngine.start()
            } catch {
                print("There was a problem starting the audio engine.")
            }

            // Start the recognition task
            self.recognitionTask = self.speechRecognizer?.recognitionTask(with: recognitionRequest, resultHandler: { (result, error) in
                DispatchQueue.main.async {
                    if (!self.isRecording) {
                        return
                    }

                    if let result = result {
                        // Pass both the recognized text and the isFinal flag to the closure
                        self.onRecognisedText?(result.bestTranscription.formattedString, result.isFinal)
                        print("Recognition: \(result.bestTranscription.formattedString), Final: \(result.isFinal)")
                    } else if let error = error {
                        // Handle any errors here
                        print("Error during recognition: \(error.localizedDescription)")
                    }
                }
            })
        }
    }

    func stopRecording() {
        print("Stop recording")

        if !self.isRecording {
            return
        }

        DispatchQueue.main.async {
            self.recognitionTask?.cancel()
            self.recognitionRequest?.endAudio()

            self.audioEngine.inputNode.removeTap(onBus: 0)
            self.audioEngine.stop()

            do {
                try self.audioSession.setActive(false)
            } catch {
                print("There was a problem stopping the audio engine.")
            }

            // Reset recognition-related properties
            self.recognitionRequest = nil
            self.recognitionTask = nil
            self.isRecording = false
        }
    }

    private func updateAlert(title: String, message: String) {
       self.showAlert = true
       self.alertTitle = title
       self.alertMessage = message
   }
}

Please also find below the output of my console log:

Speech Recognition(25104,0x1f04dfec0) malloc: Unable to set up reclaim buffer (46) - disabling large cache
recorder
Microphone access denied
No content found in UserDefaults at INIT for key 'RecognizedText'.
No content found in UserDefaults at onAppear for key 'RecognizedText'.
#FactoryInstall Unable to query results, error: 5
Unable to list voice folder
Unable to list voice folder
Unable to list voice folder
Start recording
Entered the setRecord Function
Unable to list voice folder
Unable to list voice folder
Start recording - reset
Entered the setRecord Function
Install the tap on the audio engine's input node
DEBUG:Recognized text received: 'I', Final: false
Used Console
UserDefaults content for key 'RecognizedText': I
Recognition: I, Final: false
DEBUG:Recognized text received: 'I will', Final: false
Used Console
UserDefaults content for key 'RecognizedText': I will
Recognition: I will, Final: false
DEBUG:Recognized text received: 'I will talk', Final: false
Used Console
UserDefaults content for key 'RecognizedText': I will talk
Recognition: I will talk, Final: false
DEBUG:Recognized text received: 'I will talk now', Final: false
Used Console
UserDefaults content for key 'RecognizedText': I will talk now
Recognition: I will talk now, Final: false
DEBUG:Recognized text received: 'I will talk now pause', Final: false
Used Console
UserDefaults content for key 'RecognizedText': I will talk now pause
Recognition: I will talk now pause, Final: false
DEBUG:Recognized text received: 'I will talk now pause', Final: false
Used Console
UserDefaults content for key 'RecognizedText': I will talk now pause
Recognition: I will talk now pause, Final: false
DEBUG:Recognized text received: 'I', Final: false
Used Console
UserDefaults content for key 'RecognizedText': I
Recognition: I, Final: false
DEBUG:Recognized text received: 'I am', Final: false
Used Console
UserDefaults content for key 'RecognizedText': I am
Recognition: I am, Final: false
DEBUG:Recognized text received: 'I am now', Final: false
Used Console
UserDefaults content for key 'RecognizedText': I am now
Recognition: I am now, Final: false
DEBUG:Recognized text received: 'I am now continu', Final: false
Used Console
UserDefaults content for key 'RecognizedText': I am now continu
Recognition: I am now continu, Final: false
DEBUG:Recognized text received: 'I am now continuing', Final: false
Used Console
UserDefaults content for key 'RecognizedText': I am now continuing
Recognition: I am now continuing, Final: false
DEBUG:Recognized text received: 'I am now continuing my', Final: false
Used Console
UserDefaults content for key 'RecognizedText': I am now continuing my
Recognition: I am now continuing my, Final: false
DEBUG:Recognized text received: 'I am now continuing my new', Final: false
Used Console
UserDefaults content for key 'RecognizedText': I am now continuing my new
Recognition: I am now continuing my new, Final: false
DEBUG:Recognized text received: 'I am now continuing my new', Final: false
Used Console
UserDefaults content for key 'RecognizedText': I am now continuing my new
Recognition: I am now continuing my new, Final: false
Stop recording