pytube / pytube

A lightweight, dependency-free Python library (and command-line utility) for downloading YouTube Videos.
https://pytube.io
The Unlicense
12.08k stars 2.5k forks source link

[BUG] KeyError 'start' when getting captions from a video #1085

Open gabrielziegler3 opened 3 years ago

gabrielziegler3 commented 3 years ago

I keep getting KeyError: 'start' when I try to get a caption from a video in a playlist.

To Reproduce Here is the code I am trying to test:

import pytube
from pytube import Playlist, YouTube

url = "https://www.youtube.com/watch?v=vKA4w2O61Xo&list=PLkahZjV5wKe8WFEwvs69V7JO-Cx57rZ8W"
p = Playlist(url)

for v in p.videos[:3]:
    print("trying to get captions for:", v.title)
    print(v.captions["a.en"].generate_srt_captions())

This code used to print the caption before updating pytube, but now it breaks with the following trace:

KeyError                                  Traceback (most recent call last)
~\test_pytube.py in <module>
     10 for v in p.videos[:3]:
     11     print("trying to get captions for:", v.title)
---> 12     print(v.captions["a.en"].generate_srt_captions())

~\AppData\Roaming\Python\Python38\site-packages\pytube\captions.py in generate_srt_captions(s
elf)
     49         recompiles them into the "SubRip Subtitle" format.
     50         """
---> 51         return self.xml_caption_to_srt(self.xml_captions)
     52
     53     @staticmethod

~\AppData\Roaming\Python\Python38\site-packages\pytube\captions.py in xml_caption_to_srt(self
, xml_captions)
     81             except KeyError:
     82                 duration = 0.0
---> 83             start = float(child.attrib["start"])
     84             end = start + duration
     85             sequence_number = i + 1  # convert from 0-indexed to 1.

KeyError: 'start'

System information Please provide the following information:

github-actions[bot] commented 3 years ago

Thank you for contributing to PyTube. Please remember to reference Contributing.md

gabrielziegler3 commented 3 years ago

Is there any update regarding this issue? Has the way to download captions changed since version <11.0.0?

glubsy commented 3 years ago

I have not had a close look at the code, but the start to a possible solution would be to change this:

start = float(child.attrib["start"])

to this:

start = float(child.attrib.get("start", 0))

This will get rid of the exception, but might produce unwanted consequences. I don't know.

maksimbolonkin commented 2 years ago

Apparently YouTube changed their captions format.

Here's my version of the function for the function in captions.py:

def xml_caption_to_srt(self, xml_captions: str) -> str:
      """Convert xml caption tracks to "SubRip Subtitle (srt)".

      :param str xml_captions:
          XML formatted caption tracks.
      """
      segments = []
      root = ElementTree.fromstring(xml_captions)[1]
      i=0
      for child in list(root):
          if child.tag == 'p':
              caption = ''
              if len(list(child))==0:
                  continue
              for s in list(child):
                  if s.tag == 's':
                      caption += ' ' + s.text
              caption = unescape(caption.replace("\n", " ").replace("  ", " "),)
              try:
                  duration = float(child.attrib["d"])/1000.0
              except KeyError:
                  duration = 0.0
              start = float(child.attrib["t"])/1000.0
              end = start + duration
              sequence_number = i + 1  # convert from 0-indexed to 1.
              line = "{seq}\n{start} --> {end}\n{text}\n".format(
                  seq=sequence_number,
                  start=self.float_to_srt_time_format(start),
                  end=self.float_to_srt_time_format(end),
                  text=caption,
              )
              segments.append(line)
              i += 1
      return "\n".join(segments).strip()
Rashad-j commented 2 years ago

I have the same problem. Any solutions yet?

@maksimbolonkin thanks for the suggestion, I am using it.

geomags3 commented 2 years ago

@maksimbolonkin I changed a bit your code, so now it works for me as well 👌 The issue I've found was caused by the fact that some captions are located inside of <p> tag

    def xml_caption_to_srt(self, xml_captions: str) -> str:
        """Convert xml caption tracks to "SubRip Subtitle (srt)".

        :param str xml_captions:
        XML formatted caption tracks.
        """
        segments = []
        root = ElementTree.fromstring(xml_captions)
        i=0
        for child in list(root.iter("body"))[0]:
            if child.tag == 'p':
                caption = ''
                if len(list(child))==0:
                    # instead of 'continue'
                    caption = child.text
                for s in list(child):
                    if s.tag == 's':
                        caption += ' ' + s.text
                caption = unescape(caption.replace("\n", " ").replace("  ", " "),)
                try:
                    duration = float(child.attrib["d"])/1000.0
                except KeyError:
                    duration = 0.0
                start = float(child.attrib["t"])/1000.0
                end = start + duration
                sequence_number = i + 1  # convert from 0-indexed to 1.
                line = "{seq}\n{start} --> {end}\n{text}\n".format(
                    seq=sequence_number,
                    start=self.float_to_srt_time_format(start),
                    end=self.float_to_srt_time_format(end),
                    text=caption,
                )
                segments.append(line)
                i += 1
        return "\n".join(segments).strip()

So to fix this bug we can just replace xml_caption_to_srt inside of pytube/captions.py/Caption class with current code. Hope it's gonna work for everyone 👍

urna commented 2 years ago

@maksimbolonkin I changed a bit your code, so now it works for me as well ok_hand The issue I've found was caused by the fact that some captions are located inside of <p> tag

    def xml_caption_to_srt(self, xml_captions: str) -> str:
        """Convert xml caption tracks to "SubRip Subtitle (srt)".

        :param str xml_captions:
        XML formatted caption tracks.
        """
        segments = []
        root = ElementTree.fromstring(xml_captions)
        i=0
        for child in list(root.iter("body"))[0]:
            if child.tag == 'p':
                caption = ''
                if len(list(child))==0:
                    # instead of 'continue'
                    caption = child.text
                for s in list(child):
                    if s.tag == 's':
                        caption += ' ' + s.text
                caption = unescape(caption.replace("\n", " ").replace("  ", " "),)
                try:
                    duration = float(child.attrib["d"])/1000.0
                except KeyError:
                    duration = 0.0
                start = float(child.attrib["t"])/1000.0
                end = start + duration
                sequence_number = i + 1  # convert from 0-indexed to 1.
                line = "{seq}\n{start} --> {end}\n{text}\n".format(
                    seq=sequence_number,
                    start=self.float_to_srt_time_format(start),
                    end=self.float_to_srt_time_format(end),
                    text=caption,
                )
                segments.append(line)
                i += 1
        return "\n".join(segments).strip()

So to fix this bug we can just replace xml_caption_to_srt inside of pytube/captions.py/Caption class with current code. Hope it's gonna work for everyone +1

it works for me too. thanks

MaggieKuo commented 2 years ago

@maksimbolonkin I changed a bit your code, so now it works for me as well 👌 The issue I've found was caused by the fact that some captions are located inside of <p> tag

    def xml_caption_to_srt(self, xml_captions: str) -> str:
        """Convert xml caption tracks to "SubRip Subtitle (srt)".

        :param str xml_captions:
        XML formatted caption tracks.
        """
        segments = []
        root = ElementTree.fromstring(xml_captions)
        i=0
        for child in list(root.iter("body"))[0]:
            if child.tag == 'p':
                caption = ''
                if len(list(child))==0:
                    # instead of 'continue'
                    caption = child.text
                for s in list(child):
                    if s.tag == 's':
                        caption += ' ' + s.text
                caption = unescape(caption.replace("\n", " ").replace("  ", " "),)
                try:
                    duration = float(child.attrib["d"])/1000.0
                except KeyError:
                    duration = 0.0
                start = float(child.attrib["t"])/1000.0
                end = start + duration
                sequence_number = i + 1  # convert from 0-indexed to 1.
                line = "{seq}\n{start} --> {end}\n{text}\n".format(
                    seq=sequence_number,
                    start=self.float_to_srt_time_format(start),
                    end=self.float_to_srt_time_format(end),
                    text=caption,
                )
                segments.append(line)
                i += 1
        return "\n".join(segments).strip()

So to fix this bug we can just replace xml_caption_to_srt inside of pytube/captions.py/Caption class with current code. Hope it's gonna work for everyone 👍

It worked for me. Thanks.

urna commented 2 years ago

邮件已收到,通常我在1~2天内回复。如有急事,请直接电话联系 ----我的微信 haijun-data

Joezhouzmz commented 2 years ago

Apparently YouTube changed their captions format.

Here's my version of the function for the function in captions.py:

def xml_caption_to_srt(self, xml_captions: str) -> str:
      """Convert xml caption tracks to "SubRip Subtitle (srt)".

      :param str xml_captions:
          XML formatted caption tracks.
      """
      segments = []
      root = ElementTree.fromstring(xml_captions)[1]
      i=0
      for child in list(root):
          if child.tag == 'p':
              caption = ''
              if len(list(child))==0:
                  continue
              for s in list(child):
                  if s.tag == 's':
                      caption += ' ' + s.text
              caption = unescape(caption.replace("\n", " ").replace("  ", " "),)
              try:
                  duration = float(child.attrib["d"])/1000.0
              except KeyError:
                  duration = 0.0
              start = float(child.attrib["t"])/1000.0
              end = start + duration
              sequence_number = i + 1  # convert from 0-indexed to 1.
              line = "{seq}\n{start} --> {end}\n{text}\n".format(
                  seq=sequence_number,
                  start=self.float_to_srt_time_format(start),
                  end=self.float_to_srt_time_format(end),
                  text=caption,
              )
              segments.append(line)
              i += 1
      return "\n".join(segments).strip()

It works for me as well. Thanks!

Mhmd-Hisham commented 2 years ago

I was trying to download videos from here: https://www.youtube.com/watch?v=gqaHkPEZAew&list=PLoROMvodv4rOSH4v6133s9LFPRHjEmbmJ

The code snippet above didn't work out for me. So I had to modify it a little. Here's the code that worked for me just in case anyone needs it:

def xml_caption_to_srt(self, xml_captions: str) -> str:
    """Convert xml caption tracks to "SubRip Subtitle (srt)".

    :param str xml_captions:
        XML formatted caption tracks.
    """
    segments = []
    root = ElementTree.fromstring(xml_captions)[0]
    i=0
    for child in list(root):
        if child.tag == 'p':
            caption = child.text
            caption = unescape(caption.replace("\n", " ").replace("  ", " "),)
            try:
                duration = float(child.attrib["d"])/1000.0
            except KeyError:
                duration = 0.0
            start = float(child.attrib["t"])/1000.0
            end = start + duration
            sequence_number = i + 1  # convert from 0-indexed to 1.
            line = "{seq}\n{start} --> {end}\n{text}\n".format(
                seq=sequence_number,
                start=self.float_to_srt_time_format(start),
                end=self.float_to_srt_time_format(end),
                text=caption,
            )
            segments.append(line)
            i += 1
    return "\n".join(segments).strip()
victoriano commented 2 years ago

it would be great if someone would create a pull request with the changes to make it work in the original library without having patch :P

bigbear22941 commented 1 year ago

@maksimbolonkin I changed a bit your code, so now it works for me as well 👌 The issue I've found was caused by the fact that some captions are located inside of <p> tag

    def xml_caption_to_srt(self, xml_captions: str) -> str:
        """Convert xml caption tracks to "SubRip Subtitle (srt)".

        :param str xml_captions:
        XML formatted caption tracks.
        """
        segments = []
        root = ElementTree.fromstring(xml_captions)
        i=0
        for child in list(root.iter("body"))[0]:
            if child.tag == 'p':
                caption = ''
                if len(list(child))==0:
                    # instead of 'continue'
                    caption = child.text
                for s in list(child):
                    if s.tag == 's':
                        caption += ' ' + s.text
                caption = unescape(caption.replace("\n", " ").replace("  ", " "),)
                try:
                    duration = float(child.attrib["d"])/1000.0
                except KeyError:
                    duration = 0.0
                start = float(child.attrib["t"])/1000.0
                end = start + duration
                sequence_number = i + 1  # convert from 0-indexed to 1.
                line = "{seq}\n{start} --> {end}\n{text}\n".format(
                    seq=sequence_number,
                    start=self.float_to_srt_time_format(start),
                    end=self.float_to_srt_time_format(end),
                    text=caption,
                )
                segments.append(line)
                i += 1
        return "\n".join(segments).strip()

So to fix this bug we can just replace xml_caption_to_srt inside of pytube/captions.py/Caption class with current code. Hope it's gonna work for everyone 👍

Thx... works for me

urna commented 1 year ago

邮件已收到,通常我在1~2天内回复。如有急事,请直接电话联系 ----我的微信 haijun-data

YoungXu06 commented 1 year ago

@maksimbolonkin I changed a bit your code, so now it works for me as well 👌 The issue I've found was caused by the fact that some captions are located inside of <p> tag

    def xml_caption_to_srt(self, xml_captions: str) -> str:
        """Convert xml caption tracks to "SubRip Subtitle (srt)".

        :param str xml_captions:
        XML formatted caption tracks.
        """
        segments = []
        root = ElementTree.fromstring(xml_captions)
        i=0
        for child in list(root.iter("body"))[0]:
            if child.tag == 'p':
                caption = ''
                if len(list(child))==0:
                    # instead of 'continue'
                    caption = child.text
                for s in list(child):
                    if s.tag == 's':
                        caption += ' ' + s.text
                caption = unescape(caption.replace("\n", " ").replace("  ", " "),)
                try:
                    duration = float(child.attrib["d"])/1000.0
                except KeyError:
                    duration = 0.0
                start = float(child.attrib["t"])/1000.0
                end = start + duration
                sequence_number = i + 1  # convert from 0-indexed to 1.
                line = "{seq}\n{start} --> {end}\n{text}\n".format(
                    seq=sequence_number,
                    start=self.float_to_srt_time_format(start),
                    end=self.float_to_srt_time_format(end),
                    text=caption,
                )
                segments.append(line)
                i += 1
        return "\n".join(segments).strip()

So to fix this bug we can just replace xml_caption_to_srt inside of pytube/captions.py/Caption class with current code. Hope it's gonna work for everyone 👍

Your fix will bring in wrong subtitle time frames like: bug_1

which should be: bug_2

I manage to correct this mismatching with a post-processing script:

def subtitle_clean(subtitle_text, save_path):
    subtitle_text = subtitle_text.split('\n\n')
    sent_num = 1

    with open(save_path, 'w') as f:
        for i in range((len(subtitle_text)-1) // 2):
            _, temp_time, temp_text = subtitle_text[i*2].split('\n')
            temp_start = temp_time.strip().split(' --> ')[0]
            _, next_time, _ = subtitle_text[(i+1)*2].split('\n')
            temp_end = next_time.strip().split(' --> ')[0]
            f.write(str(sent_num) + '\n')
            f.write(temp_start + ' --> ' + temp_end + '\n')
            f.write(temp_text.strip() + '\n')
            f.write('\n')
            sent_num += 1

        _, temp_time, temp_text = subtitle_text[-1].split('\n')
        f.write(str(sent_num) + '\n')
        f.write(temp_time + '\n')
        f.write(temp_text.strip() + '\n')

Does anyone know how to fix this mismatching in the source code?

urna commented 1 year ago

邮件已收到,通常我在1~2天内回复。如有急事,请直接电话联系 ----我的微信 haijun-data

oleh-sorokin commented 1 year ago

Could anyone explain how to use this new function @maksimbolonkin wrote? I take it we need to replace the function in the library. How could I do that?

urna commented 1 year ago

邮件已收到,通常我在1~2天内回复。如有急事,请直接电话联系 ----我的微信 haijun-data

justSam13 commented 8 months ago

I edited the code very minimally to make it work. This seems to work for me. Please Check for any issues. -> Added .find("body") -> Changed the keys from "dur" to "d" and "start" to "t" -> Divided the time by 1000. (to get seconds from miliseconds.)

def xml_caption_to_srt(self, xml_captions: str) -> str:
    """Convert xml caption tracks to "SubRip Subtitle (srt)".

    :param str xml_captions:
        XML formatted caption tracks.
    """
    segments = []
    root = ElementTree.fromstring(xml_captions).find('body')
    for i, child in enumerate(list(root)):
        text = child.text or ""
        caption = unescape(text.replace("\n", " ").replace("  ", " "),)
        try:
            duration = float(child.attrib["d"])/1000
        except KeyError:
            duration = 0.0
        start = float(child.attrib["t"])/1000
        end = start + duration
        sequence_number = i + 1  # convert from 0-indexed to 1.
        line = "{seq}\n{start} --> {end}\n{text}\n".format(
            seq=sequence_number,
            start= self.float_to_srt_time_format(start),
            end= self.float_to_srt_time_format(end),
            text=caption,
        )
        segments.append(line)
    return "\n".join(segments).strip()