glut23 / webvtt-py

Read, write, convert and segment WebVTT caption files in Python.
MIT License
188 stars 56 forks source link

Transcript file metadata missing #55

Closed peter-boucher closed 3 months ago

peter-boucher commented 7 months ago

When exporting a transcript of a conversation in Teams as a .vtt file some 'voice' metadata containing the speaker's screen name is present for each caption.

e.g.

WEBVTT

00:00:00.000 --> 00:00:00.800
<v Lisa Simpson>Knock knock</v>

00:00:02.100 --> 00:00:06.500
<v Homer Simpson>Who's there?</v>

00:00:10.530 --> 00:00:11.090
<v Lisa Simpson>Atish</v>

When I use webvtt to convert these captions to jsonl for analysis I'd like to preserve this metadata for context.

current output:

{"start": "00:00:00.000", "end": "00:00:00.800", "text": "Knock knock"}
{"start": "00:00:02.100", "end": "00:00:06.500", "text": "Who's there?"}
{"start": "00:00:10.530", "end": "00:00:11.090", "text": "Atish"}

desired output:

{"start": "00:00:00.000", "end": "00:00:00.800", "text": "Knock knock", "sender_name": "Lisa Simpson"}
{"start": "00:00:02.100", "end": "00:00:06.500", "text": "Who's there?", "sender_name": "Homer Simpson"}
{"start": "00:00:10.530", "end": "00:00:11.090", "text": "Atish", "sender_name": "Lisa Simpson"}

Sample code:

def vtt_to_jsonl(vtt_file, jsonl_file):
  captions = webvtt.read(vtt_file)

  with open(jsonl_file, 'w') as f:
    for caption in captions:
      caption_json = {
        'start': caption.start,
        'end': caption.end,
        'text': caption.text
        #'sender_name': caption.voice
      }
      json.dump(caption_json, f)
      f.write('\n')
glut23 commented 3 months ago

Hi @peter-boucher version 0.5.1 adds support for this. Closing the issue. Thanks for raising it.