Proper Examples of Diarization

aleksandr-smechov commented 1 year ago

I managed to make diarization work with the Python client using the below code.

# Set up an offline/batch recognition request
config = riva.client.RecognitionConfig(enable_word_time_offsets=True)
riva.client.add_speaker_diarization_to_config(config, True)

#req.config.encoding = ra.AudioEncoding.LINEAR_PCM    # Audio encoding can be detected from wav
#req.config.sample_rate_hertz = 0                     # Sample rate can be detected from wav and resampled if needed
config.language_code = "en-US"                    # Language code of the audio clip
config.max_alternatives = 1                       # How many top-N hypotheses to return
config.enable_automatic_punctuation = True        # Add punctuation when end of VAD detected
config.audio_channel_count = 1                    # Mono channel

However, some of the items in the words output don't contain a speaker label. Furthermore, there is no punctuation or capitalization in the words output. Here's an example:

words {
      start_time: 5160
      end_time: 5320
      word: "hey"
      confidence: -1.50252652
    }
    words {
      start_time: 5400
      end_time: 5440
      word: "good"
      confidence: -1.88693285
    }
    words {
      start_time: 5560
      end_time: 5760
      word: "morning"
      confidence: 0.549859285
      speaker_tag: 1
    }
    words {
      start_time: 5800
      end_time: 6120
      word: "everybody"
      confidence: -1.12615502
      speaker_tag: 1
    }
    words {
      start_time: 6800
      end_time: 7120
      word: "welcome"
      confidence: -1.75561452
      speaker_tag: 1
    }
    words {
      start_time: 7200
      end_time: 7240
      word: "to"
      confidence: -0.27308622
      speaker_tag: 1
    }
    ...

Can we get a proper notebook/example of how to use diarization, including how to get a diarized, punctuated, and capitalized transcript back?

messiaen commented 1 year ago

Yes, we will be adding a tutorial / example for diarization in the future.

I'm not sure why with your example audio the first couple of words aren't getting assigned a speaker, but the fact that the speaker_tag field is missing is because the value is the default (0) which is not printed by default by protobuf. The zero indicates diarization do not assign the word a speaker.

Currently Riva does not punctuate or capitalize the individual word output. We plan to fix that in the future.

virajkarandikar commented 1 year ago

This is in progress.

virajkarandikar commented 1 year ago

Tutorial has been added at https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tutorials/asr-speaker-diarization.html

nvidia-riva / python-clients

Proper Examples of Diarization #40