tl-its-umich-edu / annoto-gai

This is Github Project to Annoto GAI work
0 stars 2 forks source link

Addresses Issue #1 #30

Closed takposha closed 4 months ago

takposha commented 4 months ago

Adds code to segment transcript into sentences using the spaCy library to fix issue #1. The model used within spaCy can attempt to detect sentence boundaries in the transcript, and this is matched to the end of an SRT's caption. This helps that whole sentences are captured, so when a generated question is inserted at some point in a video, it will not interrupt a sentence midway.

The script accounts for the possibility that sentence segmentation might fail which can occur when transcription quality is poor. This is determined by checking if any segmented sentence is over 2 minutes long. If this occurs, the script will fall back to a simpler segmentation process, where the SRT file is directly used to create 30s segments of text. This should allow for videos to be processed even when the transcription quality is not ideal.