jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.55k stars 280 forks source link

duration miscalculated? #189

Closed BazookamanPH closed 1 year ago

BazookamanPH commented 1 year ago

To Reproduce

Steps to reproduce the behavior: Extract SRT data per Jason's code, as below. Returned values of text and start are correct (and match values screenscraped from YouTube later); duration value includes both current line and immediately following line.

I'm pulling videos featuring remote participants, created by my group (using, I believe, OBS, Open Broadcaster Software) and posted directly to YouTube as MKV (Matroska). (Video is instantly available for playing in YouTube upon completion of livestream, albeit without Google's auto-CC for a day or two. Could this indicated that the timestamp file is incorrectly capturing Double-Line timing info on the fly?)

Which Python version are you using?

python3.11.1

Which version of youtube-transcript-api are you using?

v0.05.0, Oct 26 2022

What code / cli command are you executing?

import sys from requests_html import HTMLSession from youtube_transcript_api import YouTubeTranscriptApi videoTag = "RO3XVXqvAII" video_url = "https://www.youtube.com/watch?v=" + videoTag session = HTMLSession() response = session.get(video_url) srt = YouTubeTranscriptApi.get_transcript(videoTag) for srtPkt in srt: print(srtPkt) #

Expected behavior

key 'duration' should return length of only this SRT line.

Actual behaviour

'duration' includes length of this AND THE FOLLOWING line in the orIginal SRT - e.g. 9.38+9.52 == 18.90; 16.02+5.16 == 21.18. (First 'duration' value MAY be correct; I haven't researched this.) {'text': 'foreign', 'start': 1.5, 'duration': 3.0} {'text': 'live stream 52.2 on March 9th 2023.', 'start': 9.38, 'duration': 9.52} {'text': "welcome to the octave Institute we're a", 'start': 16.02, 'duration': 5.16} {'text': 'participatory online Institute that is', 'start': 18.9, 'duration': 3.84} {'text': 'communicating learning and practicing', 'start': 21.18, 'duration': 4.259} ... {'text': 'farewell', 'start': 6941.1, 'duration': 4.92} {'text': "sealance there's", 'start': 6942.96, 'duration': 5.6} {'text': 'thank you', 'start': 6946.02, 'duration': 2.54}

jdepoix commented 1 year ago

Hi @BazookamanPH, please have a look at #21.

Closing this now as it is a duplicate.