jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.55k stars 280 forks source link

get_transcript() is not working properly. #222

Closed hashimotogigantes closed 8 months ago

hashimotogigantes commented 10 months ago

video id

x1u8Ppvq13E

What code / cli command are you executing?

I am running YouTubeTranscriptApi.get_transcript

Which Python version are you using?

Python 3.11.4

Which version of youtube-transcript-api are you using?

youtube-transcript-api 0.6.1

Expected behavior

I expected to get the Japanese transcript.

Actual behaviour

Instead I received the following error message:

error message

Error: Could not retrieve a transcript for the video https://www.youtube.com/watch?v=x1u8Ppvq13E! This is most likely caused by:

No transcripts were found for any of the requested language codes: ja

For this video (x1u8Ppvq13E) transcripts are available in the following languages:

(MANUALLY CREATED) None

(GENERATED)

(TRANSLATION LANGUAGES)

Obviously, this error message is contradictory. Note that the same video is successfully summarized by YT Summarizer plugin in ChatGPT.

loonip commented 10 months ago

@hashimotogigantes hi, i was having the same problem but i just solved it. assuming you are passing str to get_transcript. passing str gives you an error. try passing the language code as a dict.

ex so instead of using transcript_list = YouTubeTranscriptApi.get_transcript(video_id, language_code) go transcript_list = YouTubeTranscriptApi.get_transcript(video_id, [language_code])

hope it helps ;)

hashimotogigantes commented 10 months ago

@loonip Thank you for your comment, but, in that case, The error message "Error: can only concatenate str (not "list") to str" is printed.

loonip commented 10 months ago

@hashimotogigantes

these work, so i dont think its a bug...

transcript_list = YouTubeTranscriptApi.get_transcript("x1u8Ppvq13E", {"ja"})
transcript_list = YouTubeTranscriptApi.get_transcript("x1u8Ppvq13E", ["ja"])

maybe you're doing something wrong with your language list

hashimotogigantes commented 10 months ago

@loonip In my environment, only the following code avoids the "can only concatenate str (not "list") to str" error, but does not avoid the "No transcripts were found for any of the requested language codes: ja" error.

transcript = YouTubeTranscriptApi.get_transcript(video_id,languages='ja')

hashimotogigantes commented 10 months ago

transcript = YouTubeTranscriptApi.get_transcript(video_id,'ja')

This code also avoids the "can only concatenate str (not "list") to str" error, but does not avoid the "No transcripts were found for any of the requested language codes: ja" error.

jdepoix commented 10 months ago

@hashimotogigantes could you please provide the exact code you are executing? I cannot replicate your error. As for @loonip YouTubeTranscriptApi.get_transcript("x1u8Ppvq13E", ["ja"]) works as expected for me.

hashimotogigantes commented 10 months ago

Thank you. Here is my code.

import re import math from youtube_transcript_api import YouTubeTranscriptApi, NoTranscriptFound, TranscriptsDisabled

from langchain.docstore.document import Document from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.llms import OpenAI from langchain.chat_models import ChatOpenAI from langchain.chains import ConversationalRetrievalChain from langchain.chains.summarize import load_summarize_chain

from pydantic import BaseModel

class Summarizer: def init(self, openai_api_key=None, vectorstore=None):

Define the language model

    self.llm4 = ChatOpenAI(openai_api_key=openai_api_key, temperature=0, model='gpt-4')
    self.llm35 = ChatOpenAI(openai_api_key=openai_api_key, temperature=0, model='gpt-3.5-turbo')
    self.llm3 = OpenAI(openai_api_key=openai_api_key, temperature=0)
    self.vectorstore = vectorstore or self.init_vectorstore(openai_api_key)
    self.qa = ConversationalRetrievalChain.from_llm(self.llm4, self.vectorstore.as_retriever(), get_chat_history=self.get_chat_history, return_source_documents=True, condense_question_llm = self.llm35)

def init_vectorstore(self, openai_api_key):
    embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
    return Chroma("langchain_store", embeddings)

@staticmethod
def get_chat_history(messages) -> str:
    """
    Custom function for ConversationalRetrievalChain.from_llm.
    It converts chat history to a string format.
    """
    chat_hist = [f"{m['role'].capitalize()}:{m['content']}" for m in messages if m['role'] in ('assistant', 'user')]

    return "\n".join(chat_hist)

@staticmethod
def extract_youtube_ids(s):
    """
    Extracts youtube video ids from a string using regex.
    """
    youtube_regex = (
        r'(https?://)?(www\.)?'
        '(youtube\.com/watch\?v=|youtu\.be/)'
        '([^&=%\?]{11})'
    )
    return [match[3] for match in re.findall(youtube_regex, s)]

def retrieve_video(self, video_id):
    transcript = YouTubeTranscriptApi.get_transcript(video_id, 'ja')
    return {'transcript': transcript, 'video_id': video_id}

def chunkify_transcript(self, video, chunk_size=50, overlap=5):
    input_transcript = video['transcript']
    print ("input_transcript:" + input_transcript)

    transcript_len = len(input_transcript)
    splits = range(0, transcript_len, chunk_size - overlap)

    new_transcript = [
        {
            'text': ' '.join([input_transcript[i]['text'] for i in range(index, min(index + chunk_size, transcript_len))]),
            'start': input_transcript[index]['start'],
            'video_id': video['video_id']
        } for index in splits
    ]

    return new_transcript

def append_vectorstore(self, transcript):

    texts = [t['text'] for t in transcript]
    metadatas = [{'start': math.floor(t['start']), 'video_id': t['video_id']} for t in transcript]

    self.vectorstore.add_texts(texts, metadatas=metadatas)

def add_video(self, video_id):

    video = self.retrieve_video(video_id)
    transcript = self.chunkify_transcript(video)

    print ("transcript3:" + transcript)

    self.append_vectorstore(transcript)

    return self.summarize_video(transcript)

def summarize_video(self, transcript_pieces):

    docs = [Document(page_content=t["text"].strip(" ")) for t in transcript_pieces]
    chain = load_summarize_chain(self.llm3, chain_type="map_reduce")
    summary = chain.run(docs)

    metadatas = [{'start': 'TEST', 'video_id': transcript_pieces[0]['video_id']}]

    self.vectorstore.add_texts([summary], metadatas=metadatas) # Add summary to vectorstore

    return summary

def new_query(self, messages):

    if messages[-1]['role'] != 'user':
        raise ValueError('Last message must be by the user.')

    query = messages[-1]['content']
    chat_history = messages[:-1]

    video_ids = self.extract_youtube_ids(query)

    if video_ids:
        try:
            result = {'answer': f'I just watched that video. Feel free to ask me questions about it. Here is a summary:\n\n{self.add_video(video_ids[0])}'}
        except (NoTranscriptFound, TranscriptsDisabled):
            result = {'answer': f'I cannot find a transcript for {video_ids[0]}. Try another video.'}
    else:
        result = self.qa({"question": query, "chat_history": chat_history})

    return result

import sys

def main():

Initialize the summarizer

summarizer = Summarizer(openai_api_key="sk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")  #You need to replace this with your OpenAI API Key

# Get YouTube video URL from user
video_url = input("Please enter the YouTube video URL: ")

# Extract video IDs from the URL
video_ids = summarizer.extract_youtube_ids(video_url)

if not video_ids:
    print("Invalid YouTube video URL.")
    sys.exit()

# Summarize the video
try:
    summary = summarizer.add_video(video_ids[0])
    print("\nSummary:\n", summary)
except Exception as error:
    print(f"Error: {error}")

if name == "main": main()

loonip commented 10 months ago

@hashimotogigantes your retrieve_video() worked just fine after i applied the change previously suggested YouTubeTranscriptApi.get_transcript(video_id, ["ja"]) , so i dont think its from get_transcript()

note: you should carefully double check where can only concatenate str (not "list") to str is coming out from. seems like you are dangling around the outputs with lists when you are trying to putting it into vectorstore, so if i were you i'd check up that part again. ( its gonna be off topic so i'd stop here )

good luck!

hashimotogigantes commented 10 months ago

I have tried the code you suggested at least 10 times. Today I tried again and got the same error. Perhaps this is caused by environmental differences. I am running in Vislal Studio 1.81.1. Here are the versions of the modules.

openai==0.27.9 youtube-transcript-api==0.6.1 pytube==15.0.0 tiktoken==0.4.0

jdepoix commented 9 months ago

@hashimotogigantes The code you provided still contains YouTubeTranscriptApi.get_transcript(video_id, 'ja'). If you change it to YouTubeTranscriptApi.get_transcript(video_id, ['ja']) I am pretty sure that it will work. As @loonip pointed out I am also very sure that can only concatenate str (not "list") to str is not caused by this module. Please check the stacktrace to see where and why the error is actually thrown.

jdepoix commented 8 months ago

I will close this now due to inactivity.

hashimotogigantes commented 8 months ago

I have tried many times to develop the following various patterns as follows, all of which failed, so I abandoned the development. The reason I included the pattern you mentioned in the code the other day is because it is relatively less problematic. I appreciate the information you all have provided, though,I abandoned development on this module.

    #transcript = YouTubeTranscriptApi.get_transcript(video_id,languages=["ja"])
    #transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['ja','en'])
    #transcript = YouTubeTranscriptApi.get_transcript(video_id, languages='ja')
    transcript = YouTubeTranscriptApi.get_transcript(video_id, "ja")
    #transcript = YouTubeTranscriptApi.get_transcript(video_id,["ja"])
    #transcript = YouTubeTranscriptApi.get_transcript("x1u8Ppvq13E",["ja"])
    #transcript = YouTubeTranscriptApi.get_transcript("x1u8Ppvq13E",{"ja"})
    #transcript = transcript_list.find_transcript("ja").fetch()