pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
http://pyannote.github.io
MIT License
6.11k stars 764 forks source link

Create Annotation object from bazinga transcript #1262

Closed Sammi-Smith closed 1 year ago

Sammi-Smith commented 1 year ago

Hi pyannote team,

First of all, this suite of tools is amazing - kudos to the team for putting these awesome tools together and for continuing to improve them.

I am looking to use some of the tools within pyannote-audio, along with the bazinga dataset to generate embeddings for a particular speaker and then identify that same speaker within other audio files. (i.e. Generate an embedding from a clip where only Sheldon is speaking and then find segments within other audio files that are likely to also be Sheldon).

My question is this: the bazinga dataset contains a "transcript" of each audio file with details about the speaker, start and end times, etc, for each word spoken. How can I convert this word by word transcript, which is structured as a list of dictionaries (one dictionary per word) into an Annotation object that is summarized at the speaker level?

For example, for "TheBigBangTheory.Season01.Episode01", the first few items in the transcript list look like this (some items replaced with ... below, for brevity):

[{'token': 'So',
  'speaker': 'sheldon_cooper',
  'forced_alignment': {'start_time': 1.4900000095367432,
   'end_time': 1.6100000143051147,
   'confidence': 0.9900000095367432},
  'addressee': 'leonard_hofstadter',
  'named_entity': None,
  'entity_linking': None},
 {'token': 'if',
  'speaker': 'sheldon_cooper',
  'forced_alignment': {'start_time': 1.659999966621399,
   'end_time': 1.7300000190734863,
   'confidence': 0.9900000095367432},
  'addressee': 'leonard_hofstadter',
  'named_entity': None,
  'entity_linking': None},
 {'token': 'a',
  'speaker': 'sheldon_cooper',
  'forced_alignment': {'start_time': 1.7400000095367432,
   'end_time': 1.7899999618530273,
   'confidence': 0.9900000095367432},
  'addressee': 'leonard_hofstadter',
  'named_entity': None,
  'entity_linking': None},
 {'token': 'photon',
  'speaker': 'sheldon_cooper',
  'forced_alignment': {'start_time': 1.7999999523162842,
   'end_time': 2.190000057220459,
   'confidence': 0.9900000095367432},
  'addressee': 'leonard_hofstadter',
  'named_entity': None,
  'entity_linking': None},
...
 {'token': 'slits',
  'speaker': 'sheldon_cooper',
  'forced_alignment': {'start_time': 12.119999885559082,
   'end_time': 12.609999656677246,
   'confidence': 0.9900000095367432},
  'addressee': 'leonard_hofstadter',
  'named_entity': None,
  'entity_linking': None},
 {'token': '.',
  'speaker': 'sheldon_cooper',
  'forced_alignment': {'start_time': 12.609999656677246,
   'end_time': 12.609999656677246,
   'confidence': 0.949999988079071},
  'addressee': 'leonard_hofstadter',
  'named_entity': None,
  'entity_linking': None},
 {'token': 'Agreed',
  'speaker': 'leonard_hofstadter',
  'forced_alignment': {'start_time': 13.0,
   'end_time': 13.34000015258789,
   'confidence': 0.9900000095367432},
  'addressee': 'sheldon_cooper',
  'named_entity': None,
  'entity_linking': None},
 {'token': ',',
  'speaker': 'leonard_hofstadter',
  'forced_alignment': {'start_time': 13.34000015258789,
   'end_time': 13.34000015258789,
   'confidence': 0.10000000149011612},
  'addressee': 'sheldon_cooper',
  'named_entity': None,
  'entity_linking': None},
 ...
 {'token': 'point',
  'speaker': 'leonard_hofstadter',
  'forced_alignment': {'start_time': 14.390000343322754,
   'end_time': 14.710000038146973,
   'confidence': 0.9900000095367432},
  'addressee': 'sheldon_cooper',
  'named_entity': None,
  'entity_linking': None},
 {'token': '?',
  'speaker': 'leonard_hofstadter',
  'forced_alignment': {'start_time': 14.710000038146973,
   'end_time': 14.710000038146973,
   'confidence': 0.949999988079071},
  'addressee': 'sheldon_cooper',
  'named_entity': None,
  'entity_linking': None}, 

How would we convert that to an Annotation object, let's call it bazinga_annotation, such that the output of bazinga_annotation.for_json()["content"] would look something like this?

[{'segment': {'start': 1.4900000095367432,
    'end': 12.609999656677246},
   'track': 0,
   'label': 'sheldon_cooper'},
  {'segment': {'start': 13.0, 
   'end': 14.710000038146973},
   'track': 1,
   'label': 'leonard_hofstadter'},
stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.