langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
45.26k stars 6.36k forks source link

When using Tencent Cloud ASR on an Android phone, recognition failure often occurs and no content is recognized. #7418

Open lvxinliang opened 3 weeks ago

lvxinliang commented 3 weeks ago

Self Checks

Dify version

0.6.16

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

  1. Using Tencent Cloud ASR through API
  2. ASR can be successful for the first time, but there will be a high probability of failure later.
  3. As shown below d070ff5538b189d8a4859fc721fb4c19

✔️ Expected Behavior

ASR is executed successfully and returns a string

❌ Actual Behavior

ASR processing result is empty

dosubot[bot] commented 3 weeks ago

@lvxinliang Hello there! I'm here to help you with bug fixes, answer questions, and guide you on becoming a contributor. Let's work together to resolve any issues you encounter.

The issue with Tencent Cloud ASR failing to recognize any content on an Android phone when using version 0.6.16 of the Dify app could be due to the first_channel_only parameter being set to 1 by default in the FlashRecognitionRequest class. This setting means that only the first audio channel is being processed, which might not contain any data if the audio is recorded in a different channel.

To resolve this issue, you can try setting first_channel_only to 0 to ensure that all audio channels are processed:

class FlashRecognitionRequest:
    def __init__(self, voice_format="mp3", engine_type="16k_zh"):
        self.engine_type = engine_type
        self.speaker_diarization = 0
        self.hotword_id = ""
        self.customization_id = ""
        self.filter_dirty = 0
        self.filter_modal = 0
        self.filter_punc = 0
        self.convert_num_mode = 1
        self.word_info = 0
        self.voice_format = voice_format
        self.first_channel_only = 1  # This might be causing the issue
        self.reinforce_hotword = 0
        self.sentence_max_length = 0

flash_recognition_request = FlashRecognitionRequest()
flash_recognition_request.set_first_channel_only(0)

This change should help in processing the audio content correctly, regardless of which channel the audio is recorded in [1].

To continue talking to Dosu, mention @dosu.

lvxinliang commented 3 weeks ago

I changed it to 0 as you said, and confirmed that it took effect, but it still doesn't work properly under Android. The iPhone is ok, but I suspect the audio is not being captured.

        self.voice_format = voice_format
        self.first_channel_only = 0
        self.reinforce_hotword = 0
        self.sentence_max_length = 0
        logger.info("engine_type: %s, voice_format: %s, speaker_diarization: %s, hotword_id: %s, "
                    "customization_id: %s, filter_dirty: %s, filter_modal: %s, filter_punc: %s, convert_num_mode: %s, "
                    "word_info: %s, first_channel_only: %s, reinforce_hotword: %s, sentence_max_length: %s",
                    self.engine_type, self.voice_format, self.speaker_diarization, self.hotword_id,
                    self.customization_id, self.filter_dirty, self.filter_modal, self.filter_punc,
                    self.convert_num_mode, self.word_info, self.first_channel_only, self.reinforce_hotword,
                    self.sentence_max_length)

log:

2024-08-20 15:30:31,981.981 INFO [Dummy-2] [flash_recognizer.py:32] - engine_type: 16k_zh, voice_format: mp3, speaker_diarization: 0, hotword_id: , customization_id: , filter_dirty: 0, filter_modal: 0, filter_punc: 0, convert_num_mode: 1, word_info: 0, first_channel_only: 0, reinforce_hotword: 0, sentence_max_length: 0
2024-08-20 15:30:53,897.897 INFO [Dummy-3] [flash_recognizer.py:32] - engine_type: 16k_zh, voice_format: mp3, speaker_diarization: 0, hotword_id: , customization_id: , filter_dirty: 0, filter_modal: 0, filter_punc: 0, convert_num_mode: 1, word_info: 0, first_channel_only: 0, reinforce_hotword: 0, sentence_max_length: 0
2024-08-20 15:31:07,469.469 INFO [Dummy-4] [flash_recognizer.py:32] - engine_type: 16k_zh, voice_format: mp3, speaker_diarization: 0, hotword_id: , customization_id: , filter_dirty: 0, filter_modal: 0, filter_punc: 0, convert_num_mode: 1, word_info: 0, first_channel_only: 0, reinforce_hotword: 0, sentence_max_length: 0
lvxinliang commented 3 weeks ago

To continue talking to Dosu, mention @dosu.

Recording through Android phone, added dump MP3, found that sometimes the audio length of MP3 saved here is 0

        # file: audio_service.py
        file_content = file.read()
        file_size = len(file_content)
        # dump mp3 file
        dump_file_name = f"/tmp/{int(time.time())}.mp3"
        with open(dump_file_name, 'wb') as f:
            f.write(file_content)
        logger.info("Audio file saved to %s, file size: %s", dump_file_name, file_size)
        if file_size > FILE_SIZE_LIMIT:
            message = f"Audio size larger than {FILE_SIZE} mb"
            raise AudioTooLargeServiceError(message)

        model_manager = ModelManager()
        model_instance = model_manager.get_default_model_instance(
            tenant_id=app_model.tenant_id,
            model_type=ModelType.SPEECH2TEXT
        )