I have an issue where when I am using real time transcription, when I am not talking, it seems like it parses random text.

heromanofe commented 8 months ago

I was able to setup model and it works really great. My code is:

`private fun testAudio() { // Initialize Whisper val mWhisper = Whisper(this) // Create Whisper instance

// Load model and vocabulary for Whisper val basePath = Global.fileOperations.getOutputDirectory("/Models", this)!!.path val modelPath = basePath + "/whisper-tiny.tflite" // Provide model file path

    val vocabPath: String = basePath +
        "/filters_vocab_multilingual.bin" // Provide vocabulary file path
    println("PATHS: ")
    println(modelPath)
    println(vocabPath)
    mWhisper.loadModel(modelPath, vocabPath, true) // Load model and set multilingual mode

// Set a listener for Whisper to handle updates and results

    mWhisper.setListener(object : IWhisperListener {
        override fun onUpdateReceived(message: String?) {
            Log.i("TRANSCRIBE_WHISPER", "New State: $message")
            // Handle Whisper status updates
        }

        override fun onResultReceived(result: String?) {
            Log.i("TRANSCRIBE_WHISPER", result ?: "")
            // Handle transcribed results
        }
    })
    // Initialize Recorder
    val mRecorder = Recorder(this) // Create Recorder instance

// Set a listener for Recorder to handle updates and audio data mRecorder.setListener(object : IRecorderListener { override fun onUpdateReceived(message: String) { // Handle Recorder status updates }

        override fun onDataReceived(samples: FloatArray) {
            // Handle audio data received during recording
            // You can forward this data to Whisper for live recognition using writeBuffer()
            mWhisper.writeBuffer(samples);
        }
    })

    mRecorder.start(); // Start recording

}`

and  override fun onResultReceived(result: String?) {
            Log.i("TRANSCRIBE_WHISPER", result ?: "")
            // Handle transcribed results
        }

seemed to return:

[audioRecordData][fine] 5s(f:5014 m:0 s:0) : pid 8824 uid 10419 sessionId 41305 sr 16000 ch 1 fmt 1

I'll make a hole in the hole. 2 times this:

[audioRecordData][fine] 10s(f:10000 m:0 s:0) : pid 8824 uid 10419 sessionId 41305 sr 16000 ch 1 fmt 1 then I'll be back with a little .... <== repeated a lot

thanks for you hard work :P

vilassn commented 8 months ago

This can be fixed with VAD detection support. But, VAD detection is not yet implemented.

ITHealer commented 8 months ago

This can be fixed with VAD detection support. But, VAD detection is not yet implemented.

I am trying to apply VAD into the C++ source of my project. Get ideas from file: https://github.com/vilassn/whisper_android/blob/master/app/src/main/cpp/silent_detection.cpp

I tried calculating dB for each input audio clip according to BUFFER_SIZE then keeping only the audio clips that have speech inserted into outputBuffer. Then use this vector to calculate log_mel_spectrogram(...). However, the test results gave me a completely different sentence than the original sentence.

This is the result when I choose the threshold as -45.0:

This is the result when I choose the threshold as -40.0:

This is the result when I choose the threshold as -35.0:

Can you help me assess where the problem might be?

heromanofe commented 8 months ago

Yea but I don't understand how VAD can fix.. random text detected. I will check what audio is recorded and report back.

vilassn commented 8 months ago

@heromanofe 512 samples are taken as a window to determine the silence for 31.25 ms. If there is sequence of silence, lets say 16 windows are silent continuously, then consider there is no voice activity (i.e. silence).

In short, check for 500ms of silence instead of 31.25 ms. 500ms means 16 windows in sequence.

I hope, this should works. I should check this too.

heromanofe commented 8 months ago

I've noticed interesting thing, I have multi-lag model and it translates my speech when I think it shouldn't

vilassn commented 8 months ago

@heromanofe Yes. This is default behaviour for other languages. It translates to English if input language is other than English. We need to regenerate model with required configuration.

heromanofe commented 8 months ago

speaking of which, I would be interested in self-generating those bin and tflite files or at least having some place where I can download other models. I will check in 1-2 hrs what whisper receives from recorder.

heromanofe commented 8 months ago

https://1drv.ms/u/s!AgXqUQNVnl-xmZ07Nq71pVUibaZUOg?e=blb6zR <-- Onedrive link, if you want, I can send file using other way. here is the audio. here is output from my app.

2023-12-07 18:18:47.100 16170-16184 ~MyStudio~.MyAppName~ 2023-12-07 18:18:47.504 16170-16360 AudioRecord 2023-12-07 18:18:47.959 16170-16351 System.out 2023-12-07 18:18:47.964 16170-16463 System.out 2023-12-07 18:18:49.186 16170-16359 TRANSCRIBE_WHISPER 2023-12-07 18:18:49.217 16170-16185 ~MyStudio~.MyAppName~ 2023-12-07 18:18:51.738 16170-16185 ~MyStudio~.MyAppName~ 2023-12-07 18:18:51.848 16170-16359 TRANSCRIBE_WHISPER 2023-12-07 18:18:52.115 16170-16181 ~MyStudio~.MyAppName~ 2023-12-07 18:18:52.115 16170-16181 ~MyStudio~.MyAppName~ 2023-12-07 18:18:52.306 16170-16181 ~MyStudio~.MyAppName~ 2023-12-07 18:18:52.504 16170-16360 AudioRecord 2023-12-07 18:18:54.449 16170-16464 System.out 2023-12-07 18:18:54.458 16170-16464 System.out 2023-12-07 18:18:54.460 16170-16170 Choreographer 2023-12-07 18:18:54.504 16170-16185 ~MyStudio~.MyAppName~ 2023-12-07 18:18:54.569 16170-16359 TRANSCRIBE_WHISPER 2023-12-07 18:18:55.591 16170-16658 ProfileInstaller 2023-12-07 18:18:55.615 16170-16199 OpenGLRenderer 2023-12-07 18:18:55.654 16170-16170 Choreographer 2023-12-07 18:18:55.732 16170-16199 OpenGLRenderer 2023-12-07 18:18:56.889 16170-16359 TRANSCRIBE_WHISPER 2023-12-07 18:18:57.504 16170-16360 AudioRecord 2023-12-07 18:18:59.690 16170-16359 TRANSCRIBE_WHISPER 2023-12-07 18:19:02.506 16170-16360 AudioRecord 2023-12-07 18:19:02.543 16170-16359 TRANSCRIBE_WHISPER 2023-12-07 18:19:05.311 16170-16359 TRANSCRIBE_WHISPER 2023-12-07 18:19:07.504 16170-16360 AudioRecord 2023-12-07 18:19:08.509 16170-16359 TRANSCRIBE_WHISPER 2023-12-07 18:19:11.442 16170-16359 TRANSCRIBE_WHISPER 2023-12-07 18:19:12.261 16318-16338 System 2023-12-07 18:19:12.505 16170-16360 AudioRecord 2023-12-07 18:19:12.562 16170-16360 AudioRecord 2023-12-07 18:19:12.563 16170-16360 AudioRecord 2023-12-07 18:19:12.607 16170-16360 AudioRecord 2023-12-07 18:19:12.607 16170-16360 AudioRecord 2023-12-07 18:19:12.607 16170-16360 AudioRecord 2023-12-07 18:19:12.607 16170-16360 AudioRecord 2023-12-07 18:19:12.664 16170-16360 Recorder 2023-12-07 18:19:14.681 16170-16359 TRANSCRIBE_WHISPER com.~MyStudio~.MyAppName~ I Compiler allocated 6018KB to compile java.lang.Object com.~MyStudio~.MyAppName~.model.XMLRPC.exeKwSafe(java.lang.String, java.lang.String, java.lang.Object, java.util.Map, com.~MyStudio~.MyAppName~.Permissions, boolean, boolean, boolean, kotlin.coroutines.Continuation) com.~MyStudio~.MyAppName~ D [audioRecordData][fine] 5s(f:5000 m:0 s:0) : pid 16170 uid 10419 sessionId 41849 sr 16000 ch 1 fmt 1 com.~MyStudio~.MyAppName~ I task refresh start com.~MyStudio~.MyAppName~ I Already Running...! com.~MyStudio~.MyAppName~ I . com.~MyStudio~.MyAppName~ I NativeAlloc concurrent copying GC freed 96451(6415KB) AllocSpace objects, 28(668KB) LOS objects, 50% free, 12MB/25MB, paused 434us,49us total 127.362ms com.~MyStudio~.MyAppName~ I NativeAlloc concurrent copying GC freed 61908(2417KB) AllocSpace objects, 3(228KB) LOS objects, 50% free, 16MB/33MB, paused 1.142ms,1.206ms total 263.029ms com.~MyStudio~.MyAppName~ I I'll make a hole in the hole com.~MyStudio~.MyAppName~ I Thread[6,tid=16181,WaitingInMainSignalCatcherLoop,Thread*=0xb400007c9343d000,peer=0x13d00000,"Signal Catcher"]: reacting to signal 3 com.~MyStudio~.MyAppName~ I
com.~MyStudio~.MyAppName~ I Wrote stack traces to tombstoned com.~MyStudio~.MyAppName~ D [audioRecordData][fine] 10s(f:10000 m:0 s:0) : pid 16170 uid 10419 sessionId 41849 sr 16000 ch 1 fmt 1 com.~MyStudio~.MyAppName~ I widget tasks! Took: 7Seconds, 438Milliseconds com.~MyStudio~.MyAppName~ I Tag Was Removed...! com.~MyStudio~.MyAppName~ I Skipped 875 frames! The application may be doing too much work on its main thread. com.~MyStudio~.MyAppName~ I NativeAlloc concurrent copying GC freed 64933(2171KB) AllocSpace objects, 1(188KB) LOS objects, 50% free, 19MB/39MB, paused 2.278ms,1.462ms total 385.249ms com.~MyStudio~.MyAppName~ I I'll make a little more of the dough. com.~MyStudio~.MyAppName~ D Installing profile for com.~MyStudio~.MyAppName~ com.~MyStudio~.MyAppName~ I Davey! duration=8462ms; Flags=0, FrameTimelineVsyncId=368239147, IntendedVsync=1075489537039394, Vsync=1075496847566394, InputEventId=0, HandleInputStart=1075496856064781, AnimationStart=1075496856077020, PerformTraversalsStart=1075496860344781, DrawStart=1075497881552749, FrameDeadline=1075489549372727, FrameInterval=1075496855368999, FrameStartTime=8354888, SyncQueued=1075497972425353, SyncStart=1075497972721343, IssueDrawCommandsStart=1075497974028738, SwapBuffers=1075497993450405, FrameCompleted=1075497999888686, DequeueBufferDuration=31823, QueueBufferDuration=672396, GpuCompleted=1075497999888686, SwapBuffersCompleted=1075497994771238, DisplayPresentTime=0, CommandSubmissionCompleted=1075497993450405, com.~MyStudio~.MyAppName~ I Skipped 142 frames! The application may be doing too much work on its main thread. com.~MyStudio~.MyAppName~ I Davey! duration=1261ms; Flags=0, FrameTimelineVsyncId=368245973, IntendedVsync=1075496863178718, Vsync=1075498049451830, InputEventId=0, HandleInputStart=1075498049999468, AnimationStart=1075498050005770, PerformTraversalsStart=1075498050892228, DrawStart=1075498083264780, FrameDeadline=1075496883866087, FrameInterval=1075498049743947, FrameStartTime=8354036, SyncQueued=1075498098311811, SyncStart=1075498098412801, IssueDrawCommandsStart=1075498100124311, SwapBuffers=1075498116940457, FrameCompleted=1075498124907593, DequeueBufferDuration=76666, QueueBufferDuration=345781, GpuCompleted=1075498124907593, SwapBuffersCompleted=1075498117692801, DisplayPresentTime=0, CommandSubmissionCompleted=1075498116940457, com.~MyStudio~.MyAppName~ I I'll make a hole in the hole com.~MyStudio~.MyAppName~ D [audioRecordData][fine] 15s(f:15000 m:0 s:0) : pid 16170 uid 10419 sessionId 41849 sr 16000 ch 1 fmt 1 com.~MyStudio~.MyAppName~ I I'll make a small piece of cake with a little bit of sugar. com.~MyStudio~.MyAppName~ D [audioRecordData][fine] 20s(f:20002 m:0 s:0) : pid 16170 uid 10419 sessionId 41849 sr 16000 ch 1 fmt 1 com.~MyStudio~.MyAppName~ I I'll make a hole in the hole. com.~MyStudio~.MyAppName~ I you com.~MyStudio~.MyAppName~ D [audioRecordData][fine] 25s(f:25000 m:0 s:0) : pid 16170 uid 10419 sessionId 41849 sr 16000 ch 1 fmt 1 com.~MyStudio~.MyAppName~ I I'll make a hole in the hole. com.~MyStudio~.MyAppName~ I I'll make a hole in the hole com.~MyStudio~.MyAppName~ W A resource failed to call close. com.~MyStudio~.MyAppName~ D [audioRecordData][fine] 30s(f:30000 m:0 s:0) : pid 16170 uid 10419 sessionId 41849 sr 16000 ch 1 fmt 1 com.~MyStudio~.MyAppName~ D stop mSessionID=41849 com.~MyStudio~.MyAppName~ D stop(10025): mActive:1 com.~MyStudio~.MyAppName~ D stop mSessionID=41849 com.~MyStudio~.MyAppName~ D stop(10025): mActive:0 com.~MyStudio~.MyAppName~ D stop mSessionID=41849 com.~MyStudio~.MyAppName~ D stop(10025): mActive:0 com.~MyStudio~.MyAppName~ D Recorded file: /storage/emulated/0/Android/media/com.~MyStudio~.MyAppName~/MyAppName~/Models/test.wav com.~MyStudio~.MyAppName~ I I'll make a small piece of cake with a little bit of sugar.

heromanofe commented 8 months ago

Okay, you were sooo right :D I remembered that I looked into VAD before. I implemented this https://github.com/gkonovalov/android-vad into my project, using implementation 'org.tensorflow:tensorflow-lite-task-audio:0.4.0' implementation 'com.github.gkonovalov.android-vad:yamnet:2.0.4' and in your code:

(Recorder)

VadYamnet vad = Vad.builder() .setContext(mContext) .setSampleRate(SampleRate.SAMPLE_RATE_16K) .setFrameSize(FrameSize.FRAME_SIZE_487) .setMode(Mode.NORMAL) .setSilenceDurationMs(200) .setSpeechDurationMs(30) .build(); before while loop and inside while loop:

SoundCategory soundCategory = vad.classifyAudio(samples); Log.d(TAG, soundCategory.getLabel()); Log.d(TAG, String.valueOf(soundCategory.getScore())); // Send samples for transcription if(soundCategory.getLabel().equals("Speech") && soundCategory.getScore() > 0.5) sendData(samples);

and result is this:

2023-12-07 19:27:45.835 7830-8027 Recorder com.~MyStudio~.MyAppName~ D Silence 2023-12-07 19:27:45.835 7830-8027 Recorder com.~MyStudio~.MyAppName~ D 0.0 2023-12-07 19:27:45.999 7830-8018 System.out com.~MyStudio~.MyAppName~ I Main User Logging (Auto-Login) Took: 1Seconds, 782Milliseconds 2023-12-07 19:27:46.268 7830-7830 DecorView[] com.~MyStudio~.MyAppName~ D onWindowFocusChanged hasWindowFocus false 2023-12-07 19:27:46.330 30975-31175 ActivityManagerWrapper com.mi.android.globallauncher E getRecentTasks: mainTaskId=3824 userId=0 baseIntent=Intent { act=android.intent.action.MAIN flag=268435456 cmp=ComponentInfo{com.~MyStudio~.MyAppName~/com.~MyStudio~.MyAppName~.MainActivity} } 2023-12-07 19:27:46.349 30975-31175 ActivityManagerWrapper com.mi.android.globallauncher E getRecentTasks: mainTaskId=3824 userId=0 baseIntent=Intent { act=android.intent.action.MAIN flag=268435456 cmp=ComponentInfo{com.~MyStudio~.MyAppName~/com.~MyStudio~.MyAppName~.MainActivity} } 2023-12-07 19:27:46.456 7830-7830 com.github...orActivity com.~MyStudio~.MyAppName~ D Detect NFC state changes while previously enabled 2023-12-07 19:27:46.456 7830-7830 com.github...orActivity com.~MyStudio~.MyAppName~ D NFC state remains enabled 2023-12-07 19:27:46.458 7830-7830 System.out com.~MyStudio~.MyAppName~ I task refresh start 2023-12-07 19:27:46.478 7830-7830 DecorView[] com.~MyStudio~.MyAppName~ D onWindowFocusChanged hasWindowFocus true 2023-12-07 19:27:46.505 7830-7830 HandWritingStubImpl com.~MyStudio~.MyAppName~ I refreshLastKeyboardType: 1 2023-12-07 19:27:46.505 7830-7830 HandWritingStubImpl com.~MyStudio~.MyAppName~ I getCurrentKeyboardType: 1 2023-12-07 19:27:46.506 30975-31175 ActivityManagerWrapper com.mi.android.globallauncher E getRecentTasks: mainTaskId=3824 userId=0 baseIntent=Intent { act=android.intent.action.MAIN flag=268435456 cmp=ComponentInfo{com.~MyStudio~.MyAppName~/com.~MyStudio~.MyAppName~.MainActivity} } 2023-12-07 19:27:46.551 30975-31175 ActivityManagerWrapper com.mi.android.globallauncher E getRecentTasks: mainTaskId=3824 userId=0 baseIntent=Intent { act=android.intent.action.MAIN flag=268435456 cmp=ComponentInfo{com.~MyStudio~.MyAppName~/com.~MyStudio~.MyAppName~.MainActivity} } 2023-12-07 19:27:46.965 7830-7849 ~MyStudio~.MyAppName~ com.~MyStudio~.MyAppName~ I Compiler allocated 6018KB to compile java.lang.Object com.~MyStudio~.MyAppName~.model.NetixXMLRPC.exeKwSafe(java.lang.String, java.lang.String, java.lang.Object, java.util.Map, com.~MyStudio~.MyAppName~.Permissions, boolean, boolean, boolean, kotlin.coroutines.Continuation) 2023-12-07 19:27:47.234 7830-8093 System.out com.~MyStudio~.MyAppName~ I task refresh start 2023-12-07 19:27:47.236 7830-8095 System.out com.~MyStudio~.MyAppName~ I Already Running...! 2023-12-07 19:27:47.681 7830-8027 AudioRecord com.~MyStudio~.MyAppName~ D [audioRecordData][fine] 5s(f:5019 m:0 s:0) : pid 7830 uid 10419 sessionId 42009 sr 16000 ch 1 fmt 1 2023-12-07 19:27:48.782 7830-8027 Recorder com.~MyStudio~.MyAppName~ D Silence 2023-12-07 19:27:48.782 7830-8027 Recorder com.~MyStudio~.MyAppName~ D 0.0 2023-12-07 19:27:49.617 7830-7845 ~MyStudio~.MyAppName~ com.~MyStudio~.MyAppName~ I Thread[6,tid=7845,WaitingInMainSignalCatcherLoop,Thread=0xb400007c9343d000,peer=0x13c803d0,"Signal Catcher"]: reacting to signal 3 2023-12-07 19:27:49.617 7830-7845 ~MyStudio~.MyAppName~ com.~MyStudio~.MyAppName~ I
2023-12-07 19:27:49.755 7830-7845 ~MyStudio~.MyAppName~ com.~MyStudio~.MyAppName~ I Wrote stack traces to tombstoned 2023-12-07 19:27:51.730 7830-8027 Recorder com.~MyStudio~.MyAppName~ D Speech 2023-12-07 19:27:51.730 7830-8027 Recorder com.~MyStudio~.MyAppName~ D 0.95703125 2023-12-07 19:27:52.171 7830-7845 ~MyStudio~.MyAppName~ com.~MyStudio~.MyAppName~ I Thread[6,tid=7845,WaitingInMainSignalCatcherLoop,Thread=0xb400007c9343d000,peer=0x13c803d0,"Signal Catcher"]: reacting to signal 3 2023-12-07 19:27:52.171 7830-7845 ~MyStudio~.MyAppName~ com.~MyStudio~.MyAppName~ I
2023-12-07 19:27:52.367 7830-8017 System.out com.~MyStudio~.MyAppName~ I widget tasks! Took: 5Seconds, 518Milliseconds 2023-12-07 19:27:52.368 7830-8017 System.out com.~MyStudio~.MyAppName~ I Tag Was Removed...! 2023-12-07 19:27:52.370 7830-7830 Choreographer com.~MyStudio~.MyAppName~ I Skipped 658 frames! The application may be doing too much work on its main thread. 2023-12-07 19:27:52.418 7830-7845 ~MyStudio~.MyAppName~ com.~MyStudio~.MyAppName~ I Waiting for a blocking GC ObjectsAllocated 2023-12-07 19:27:52.551 7830-7850 ~MyStudio~.MyAppName~ com.~MyStudio~.MyAppName~ I NativeAlloc concurrent copying GC freed 29251(1167KB) AllocSpace objects, 4(412KB) LOS objects, 50% free, 19MB/39MB, paused 145us,59us total 257.088ms 2023-12-07 19:27:52.551 7830-7845 ~MyStudio~.MyAppName~ com.~MyStudio~.MyAppName~ I WaitForGcToComplete blocked ObjectsAllocated on NativeAlloc for 133.523ms 2023-12-07 19:27:52.552 7830-7845 ~MyStudio~.MyAppName~ com.~MyStudio~.MyAppName~ I Wrote stack traces to tombstoned 2023-12-07 19:27:52.681 7830-8027 AudioRecord com.~MyStudio~.MyAppName~ D [audioRecordData][fine] 10s(f:10020 m:0 s:0) : pid 7830 uid 10419 sessionId 42009 sr 16000 ch 1 fmt 1 2023-12-07 19:27:53.282 7830-7866 OpenGLRenderer com.~MyStudio~.MyAppName~ I Davey! duration=6403ms; Flags=0, FrameTimelineVsyncId=373222770, IntendedVsync=1079629264489237, Vsync=1079634761097501, InputEventId=0, HandleInputStart=1079634765932890, AnimationStart=1079634765937525, PerformTraversalsStart=1079634767145181, DrawStart=1079635577064712, FrameDeadline=1079629276822570, FrameInterval=1079634765744192, FrameStartTime=8353508, SyncQueued=1079635649329816, SyncStart=1079635649478150, IssueDrawCommandsStart=1079635650830285, SwapBuffers=1079635662302577, FrameCompleted=1079635668211848, DequeueBufferDuration=43594, QueueBufferDuration=365625, GpuCompleted=1079635668211848, SwapBuffersCompleted=1079635663156587, DisplayPresentTime=0, CommandSubmissionCompleted=1079635662302577, 2023-12-07 19:27:53.307 7830-8167 ProfileInstaller com.~MyStudio~.MyAppName~ D Installing profile for com.~MyStudio~.MyAppName~ 2023-12-07 19:27:53.369 7830-7830 Choreographer com.~MyStudio~.MyAppName~ I Skipped 119 frames! The application may be doing too much work on its main thread. 2023-12-07 19:27:53.470 7830-7866 OpenGLRenderer com.~MyStudio~.MyAppName~ I Davey! duration=1089ms; Flags=0, FrameTimelineVsyncId=373233395, IntendedVsync=1079634769368090, Vsync=1079635763385681, InputEventId=0, HandleInputStart=1079635764478931, AnimationStart=1079635764496691, PerformTraversalsStart=1079635765899764, DrawStart=1079635814580702, FrameDeadline=1079634790054512, FrameInterval=1079635763763462, FrameStartTime=8353089, SyncQueued=1079635834791327, SyncStart=1079635834967316, IssueDrawCommandsStart=1079635837195545, SwapBuffers=1079635852484764, FrameCompleted=1079635859409816, DequeueBufferDuration=51198, QueueBufferDuration=531875, GpuCompleted=1079635859409816, SwapBuffersCompleted=1079635854016743, DisplayPresentTime=0, CommandSubmissionCompleted=1079635852484764, 2023-12-07 19:27:54.352 7830-8026 TRANSCRIBE_WHISPER com.~MyStudio~.MyAppName~ I hello this is test to use 2023-12-07 19:27:54.714 7830-8027 Recorder com.~MyStudio~.MyAppName~ D Speech 2023-12-07 19:27:54.714 7830-8027 Recorder com.~MyStudio~.MyAppName~ D 0.95703125 2023-12-07 19:27:56.644 7830-8026 TRANSCRIBE_WHISPER com.~MyStudio~.MyAppName~ I 16k sample rate and 2023-12-07 19:27:57.698 7830-8027 Recorder com.~MyStudio~.MyAppName~ D Speech 2023-12-07 19:27:57.699 7830-8027 Recorder com.~MyStudio~.MyAppName~ D 0.98046875 2023-12-07 19:27:57.699 7830-8027 AudioRecord com.~MyStudio~.MyAppName~ D [audioRecordData][fine] 15s(f:15038 m:0 s:0) : pid 7830 uid 10419 sessionId 42009 sr 16000 ch 1 fmt 1 2023-12-07 19:27:59.592 7830-8026 TRANSCRIBE_WHISPER com.~MyStudio~.MyAppName~ I and the frame size 487. 2023-12-07 19:28:00.706 7830-8027 Recorder com.~MyStudio~.MyAppName~ D Speech 2023-12-07 19:28:00.706 7830-8027 Recorder com.~MyStudio~.MyAppName~ D 0.96875 2023-12-07 19:28:02.532 7830-8026 TRANSCRIBE_WHISPER com.~MyStudio~.MyAppName~ I with normal mode. Speech detection. 2023-12-07 19:28:02.682 7830-8027 AudioRecord com.~MyStudio~.MyAppName~ D [audioRecordData][fine] 20s(f:20020 m:0 s:0) : pid 7830 uid 10419 sessionId 42009 sr 16000 ch 1 fmt 1 2023-12-07 19:28:03.696 7830-8027 Recorder com.~MyStudio~.MyAppName~ D Silence 2023-12-07 19:28:03.696 7830-8027 Recorder com.~MyStudio~.MyAppName~ D 0.0 2023-12-07 19:28:06.685 7830-8027 Recorder com.~MyStudio~.MyAppName~ D Silence 2023-12-07 19:28:06.685 7830-8027 Recorder com.~MyStudio~.MyAppName~ D 0.0 2023-12-07 19:28:07.681 7830-8027 AudioRecord com.~MyStudio~.MyAppName~ D [audioRecordData][fine] 25s(f:25020 m:0 s:0) : pid 7830 uid 10419 sessionId 42009 sr 16000 ch 1 fmt 1 2023-12-07 19:28:09.681 7830-8027 Recorder com.~MyStudio~.MyAppName~ D Animal 2023-12-07 19:28:09.681 7830-8027 Recorder com.~MyStudio~.MyAppName~ D 0.4140625 2023-12-07 19:28:12.526 7987-8007 System com.~MyStudio~.MyAppName~ W A resource failed to call close. 2023-12-07 19:28:12.670 7830-8027 Recorder com.~MyStudio~.MyAppName~ D Silence 2023-12-07 19:28:12.671 7830-8027 Recorder com.~MyStudio~.MyAppName~ D 0.0 2023-12-07 19:28:12.680 7830-8027 AudioRecord com.~MyStudio~.MyAppName~ D stop mSessionID=42009 2023-12-07 19:28:12.680 7830-8027 AudioRecord com.~MyStudio~.MyAppName~ D stop(10055): mActive:1 2023-12-07 19:28:12.740 7830-8027 AudioRecord com.~MyStudio~.MyAppName~ D stop mSessionID=42009 2023-12-07 19:28:12.740 7830-8027 AudioRecord com.~MyStudio~.MyAppName~ D stop(10055): mActive:0 2023-12-07 19:28:12.741 7830-8027 AudioRecord com.~MyStudio~.MyAppName~ D stop mSessionID=42009 2023-12-07 19:28:12.741 7830-8027 AudioRecord com.~MyStudio~.MyAppName~ D stop(10055): mActive:0 2023-12-07 19:28:12.753 7830-8027 Recorder com.~MyStudio~.MyAppName~ D Recorded file: /storage/emulated/0/Android/media/com.~MyStudio~.MyAppName~/MyAppName~/Models/test.wav

here is onedrive link to file: https://1drv.ms/u/s!AgXqUQNVnl-xmZ086HH_M3ekp8XeUQ?e=dQyePD

vilassn commented 8 months ago

Has your problem been solved?

heromanofe commented 8 months ago

Has your problem been solved?

it was VAD problem, thou I wouldn't be celebrating for now. I noticed there is some speech it detected as silence instead :D I need to fine-tune it, but then its working 100% :P thanks for you work

ITHealer commented 8 months ago

Has your problem been solved?

it was VAD problem, thou I wouldn't be celebrating for now. I noticed there is some speech it detected as silence instead :D I need to fine-tune it, but then its working 100% :P thanks for you work

Can you guide me how to run the project from the repo: https://github.com/gkonovalov/android-vad Is that Okay?

I ran it but when I clicked record even though I was still talking the result was "Noise detected". I don't understand how it works?

heromanofe commented 8 months ago

I don't know about app, all I did was this (in Recorder.java file)

heromanofe commented 8 months ago

Quick update about my situation, I decided to write kotlin code for real-time recognition. it works very simple, I am taking your recording system and just leaving out 1second chunks part. then in my code I have a system for tracking timeout. there are 2 timeouts, first: if I don't talk for 5 seconds after activating, timeout and when I stop talking < 2 second timeout. when 2nd timeout happens, I am gathering all floatArrays I've created and pushing to whisper for recognition, result is this:

2023-12-11 19:53:26.300 30867-31012 WHISPER: New State com.ERPStudio.ErpDroid W READY 2023-12-11 19:53:26.309 30867-31012 WHISPER: New State com.ERPStudio.ErpDroid W LISTENING_WAITING 2023-12-11 19:53:29.591 30867-31039 WHISPER: New State com.ERPStudio.ErpDroid W LISTENING_RECORDING 2023-12-11 19:53:34.509 30867-31039 System.out com.ERPStudio.ErpDroid I Whisper: recognizing text.... 2023-12-11 19:53:34.509 30867-31039 WHISPER: New State com.ERPStudio.ErpDroid W READY 2023-12-11 19:53:37.363 30867-31038 TRANSCRIBE_WHISPER com.ERPStudio.ErpDroid I Test Test 1,2,3, Test Test so transcribe whisper took 3 seconds to recognise that text I have.

I am making 2bl app, I need both: TTS which like here can be slow and Commands (like start X do Y) and those specifically ideally should be very quick, but this 3 second delay is too much for me. what can you suggest for speed optimization, keep in mind I am using right now whisper-tiny.tflite, so multi-lang model. would using eng model speed things up?

vilassn commented 8 months ago

Transcription time varies device to device. On high end device, transcription time will be less.

You can debug what is taking more time. Whether it is Mel spectrogram calculation or inference.

matanel-6over6 commented 7 months ago

Hi, first of all thanks for the hard work. Is there a solution to the quiet mode issue? I don't speak and there is complete silence and words are still coming back to me

heromanofe commented 7 months ago

@matanel-6over6 scroll up for screenshots, here is library: https://github.com/gkonovalov/android-vad You need VAD and that was pretty good solution for me

matanel-6over6 commented 7 months ago

@heromanofe Thanks for the quick reply. What should I take from the project I mentioned to Vilassn's project?

heromanofe commented 7 months ago

you implement that library in gradle ( implementation 'org.tensorflow:tensorflow-lite-task-audio:0.4.0' implementation 'com.github.gkonovalov.android-vad:yamnet:2.0.4' ) and for this project, Recorder.java <-- file you add vad there

matanel-6over6 commented 7 months ago

Do I need to add what you marked to the Class of the recorder?

heromanofe commented 7 months ago

in screenshot stuff there, implementation is gradle (app/build.gradle)

matanel-6over6 commented 7 months ago

@heromanofe Yes, I understand, thank you very much.

matanel-6over6 commented 7 months ago

@heromanofe Working grate. Thank you very much

vilassn / whisper_android

I have an issue where when I am using real time transcription, when I am not talking, it seems like it parses random text. #4