gtreshchev / RuntimeSpeechRecognizer

Cross-platform, real-time, offline speech recognition plugin for Unreal Engine. Based on Whisper OpenAI technology, whisper.cpp.
MIT License
246 stars 38 forks source link

[Blank Audio] , I can't capture sound wave #31

Closed liangzhenbin1992 closed 2 months ago

liangzhenbin1992 commented 3 months ago

when I set BP like this and compile it, everything seems fine, but when I started, it shows blank Audio, and There's no issue with the microphone. so I am confuse, I don't know what happens. Please help me , thanks a lot. and this is the BP and the output log: 1719117809541 output_LOG.txt

[part of output log]: …… LogRuntimeSpeechRecognizer: Pending audio data instead of enqueuing it since it is not enough to fill the step size (pending: 14879, num of samples per step: 80000) LogRuntimeAudioImporter: No need to resample or mix audio data LogRuntimeAudioImporter: Reallocating buffer to append data (new capacity: 566400) LogRuntimeSpeechRecognizer: Pending audio data instead of enqueuing it since it is not enough to fill the step size (pending: 15039, num of samples per step: 80000) LogBlueprintUserMessages: [BP_ThirdPersonCharacter_C_0] 完成声音捕获 LogRuntimeSpeechRecognizer: Enqueued audio data from the pending audio to the queue of the speech recognizer as the last data (num of samples: 15039) LogRuntimeSpeechRecognizer: Processed audio data with the size of 79360 samples to the whisper recognizer LogRuntimeSpeechRecognizer: Recognized text segment: " [BLANK_AUDIO]" LogBlueprintUserMessages: [BP_ThirdPersonCharacter_C_0] [BLANK_AUDIO] LogRuntimeSpeechRecognizer: Speech recognition progress: 100 LogRuntimeSpeechRecognizer: Speech recognition progress: 0 LogRuntimeSpeechRecognizer: Processed audio data with the size of 17600 samples to the whisper recognizer LogRuntimeSpeechRecognizer: Recognized text segment: " [BLANK_AUDIO]" LogBlueprintUserMessages: [BP_ThirdPersonCharacter_C_0] [BLANK_AUDIO] LogRuntimeSpeechRecognizer: Speech recognition progress: 100 LogBlueprintUserMessages: [BP_ThirdPersonCharacter_C_0] Can't capture sound wave LogRuntimeSpeechRecognizer: Speech recognition finished LogRuntimeSpeechRecognizer: Stopping the speech recognizer thread LogCore: Display: Tracing Screenshot "ScreenShot00007" taken with size: 2578 x 1408 LogCore: Display: Tracing Screenshot "ScreenShot00008" taken with size: 2578 x 1408 LogSlate: Updating window title bar state: overlay mode, drag disabled, window buttons hidden, title bar hidden LogWorld: BeginTearingDown for /Game/ThirdPerson/Maps/UEDPIE_0_ThirdPersonMap LogWorld: UWorld::CleanupWorld for ThirdPersonMap, bSessionEnded=true, bCleanupResources=true LogSlate: InvalidateAllWidgets triggered. All widgets were invalidated LogWorldPartition: UWorldPartition::Uninitialize : World = /Game/ThirdPerson/Maps/UEDPIE_0_ThirdPersonMap.ThirdPersonMap LogContentBundle: [ThirdPersonMap(Standalone)] Deleting container. LogWorldMetrics: [UWorldMetricsSubsystem::Deinitialize] LogWorldMetrics: [UWorldMetricsSubsystem::Clear] LogPlayLevel: Display: Shutting down PIE online subsystems LogSlate: InvalidateAllWidgets triggered. All widgets were invalidated LogRuntimeAudioImporter: Warning: Imported sound wave ('CapturableSoundWave_1') data will be cleared because it is being unloaded LogSlate: Updating window title bar state: overlay mode, drag disabled, window buttons hidden, title bar hidden LogAudioMixer: Deinitializing Audio Bus Subsystem for audio device with ID 3 LogAudioMixer: FMixerPlatformXAudio2::StopAudioStream() called. InstanceID=3 LogAudioMixer: FMixerPlatformXAudio2::StopAudioStream() called. InstanceID=3 LogUObjectHash: Compacting FUObjectHashTables data took 0.79ms LogPlayLevel: Display: Destroying online subsystem :Context_8

gtreshchev commented 3 months ago

Based on your logs, it seems you processed the analysis only twice for your speech segment in total. There are two things that caught my attention in your logs:

  1. When [BLANK_AUDIO] is recognized in the text segment, the processed audio data size is quite small. For example, in the first case, it's 79360 samples, which could be less than a second with a sample rate of 44100 and 2 channels on average. In the second case, it's 17600 samples, which is even smaller and could be about 0.2 seconds under the previously specified sample rate and number of channels. These durations are too brief to assume there's any speech data present, particularly in the second case. I suggest confirming that your voice is actually being captured in these recognitions by playing back the capturable sound wave (you can use the PlaySound2D function, for example).
  2. Towards the end of your logs, there's a message stating Can't capture sound wave, likely because of attempts to initiate capture for multiple capturable sound waves simultaneously, which isn't supported by the hardware or UE per-platform microphone handling implementation. Please ensure that you don't capture audio data simultaneously, and start a new capture only after the previous one has been stopped (when StopCapture is called).

Also, the Blueprints implementation you created appears to handle recognition based on clicks, and you derived these nodes from the basic streaming example. But this example isn't intended for a continuous stop/start workflow, it's purely for demonstration purposes. I suggest exploring a demo project that handles microphone recognition with more attention to continuous recognition and interrupting workflows, as it generally offers better design for more complex solutions: Demo Project.

And by the way, if you prefer to suppress [BLANK_AUDIO] messages, you can do so by setting bSuppressBlank to true (SetSuppressBlank function).

arbertrary commented 3 months ago

@liangzhenbin1992 I had the same problem just today (which is how I found this issue). For me the problem was actually already solved after updating both the RuntimeSpeechRecognizer and the RuntimeAudioImporter plugin.

Would be interesting to know what had caused it, though.

gtreshchev commented 2 months ago

@liangzhenbin1992 I had the same problem just today (which is how I found this issue). For me the problem was actually already solved after updating both the RuntimeSpeechRecognizer and the RuntimeAudioImporter plugin.

Would be interesting to know what had caused it, though.

It could be due to issues in the underlying library whisper.cpp in earlier versions, which have been fixed since then :)

gtreshchev commented 2 months ago

Please reopen if it's still relevant.