Closed hashimotogigantes closed 8 months ago
@hashimotogigantes hi, i was having the same problem but i just solved it. assuming you are passing str to get_transcript. passing str gives you an error. try passing the language code as a dict.
ex
so instead of using
transcript_list = YouTubeTranscriptApi.get_transcript(video_id, language_code)
go
transcript_list = YouTubeTranscriptApi.get_transcript(video_id, [language_code])
hope it helps ;)
@loonip Thank you for your comment, but, in that case, The error message "Error: can only concatenate str (not "list") to str" is printed.
@hashimotogigantes
these work, so i dont think its a bug...
transcript_list = YouTubeTranscriptApi.get_transcript("x1u8Ppvq13E", {"ja"})
transcript_list = YouTubeTranscriptApi.get_transcript("x1u8Ppvq13E", ["ja"])
maybe you're doing something wrong with your language list
@loonip In my environment, only the following code avoids the "can only concatenate str (not "list") to str" error, but does not avoid the "No transcripts were found for any of the requested language codes: ja" error.
transcript = YouTubeTranscriptApi.get_transcript(video_id,languages='ja')
transcript = YouTubeTranscriptApi.get_transcript(video_id,'ja')
This code also avoids the "can only concatenate str (not "list") to str" error, but does not avoid the "No transcripts were found for any of the requested language codes: ja" error.
@hashimotogigantes could you please provide the exact code you are executing? I cannot replicate your error. As for @loonip YouTubeTranscriptApi.get_transcript("x1u8Ppvq13E", ["ja"])
works as expected for me.
Thank you. Here is my code.
import re import math from youtube_transcript_api import YouTubeTranscriptApi, NoTranscriptFound, TranscriptsDisabled
from langchain.docstore.document import Document from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.llms import OpenAI from langchain.chat_models import ChatOpenAI from langchain.chains import ConversationalRetrievalChain from langchain.chains.summarize import load_summarize_chain
from pydantic import BaseModel
class Summarizer: def init(self, openai_api_key=None, vectorstore=None):
self.llm4 = ChatOpenAI(openai_api_key=openai_api_key, temperature=0, model='gpt-4')
self.llm35 = ChatOpenAI(openai_api_key=openai_api_key, temperature=0, model='gpt-3.5-turbo')
self.llm3 = OpenAI(openai_api_key=openai_api_key, temperature=0)
self.vectorstore = vectorstore or self.init_vectorstore(openai_api_key)
self.qa = ConversationalRetrievalChain.from_llm(self.llm4, self.vectorstore.as_retriever(), get_chat_history=self.get_chat_history, return_source_documents=True, condense_question_llm = self.llm35)
def init_vectorstore(self, openai_api_key):
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
return Chroma("langchain_store", embeddings)
@staticmethod
def get_chat_history(messages) -> str:
"""
Custom function for ConversationalRetrievalChain.from_llm.
It converts chat history to a string format.
"""
chat_hist = [f"{m['role'].capitalize()}:{m['content']}" for m in messages if m['role'] in ('assistant', 'user')]
return "\n".join(chat_hist)
@staticmethod
def extract_youtube_ids(s):
"""
Extracts youtube video ids from a string using regex.
"""
youtube_regex = (
r'(https?://)?(www\.)?'
'(youtube\.com/watch\?v=|youtu\.be/)'
'([^&=%\?]{11})'
)
return [match[3] for match in re.findall(youtube_regex, s)]
def retrieve_video(self, video_id):
transcript = YouTubeTranscriptApi.get_transcript(video_id, 'ja')
return {'transcript': transcript, 'video_id': video_id}
def chunkify_transcript(self, video, chunk_size=50, overlap=5):
input_transcript = video['transcript']
print ("input_transcript:" + input_transcript)
transcript_len = len(input_transcript)
splits = range(0, transcript_len, chunk_size - overlap)
new_transcript = [
{
'text': ' '.join([input_transcript[i]['text'] for i in range(index, min(index + chunk_size, transcript_len))]),
'start': input_transcript[index]['start'],
'video_id': video['video_id']
} for index in splits
]
return new_transcript
def append_vectorstore(self, transcript):
texts = [t['text'] for t in transcript]
metadatas = [{'start': math.floor(t['start']), 'video_id': t['video_id']} for t in transcript]
self.vectorstore.add_texts(texts, metadatas=metadatas)
def add_video(self, video_id):
video = self.retrieve_video(video_id)
transcript = self.chunkify_transcript(video)
print ("transcript3:" + transcript)
self.append_vectorstore(transcript)
return self.summarize_video(transcript)
def summarize_video(self, transcript_pieces):
docs = [Document(page_content=t["text"].strip(" ")) for t in transcript_pieces]
chain = load_summarize_chain(self.llm3, chain_type="map_reduce")
summary = chain.run(docs)
metadatas = [{'start': 'TEST', 'video_id': transcript_pieces[0]['video_id']}]
self.vectorstore.add_texts([summary], metadatas=metadatas) # Add summary to vectorstore
return summary
def new_query(self, messages):
if messages[-1]['role'] != 'user':
raise ValueError('Last message must be by the user.')
query = messages[-1]['content']
chat_history = messages[:-1]
video_ids = self.extract_youtube_ids(query)
if video_ids:
try:
result = {'answer': f'I just watched that video. Feel free to ask me questions about it. Here is a summary:\n\n{self.add_video(video_ids[0])}'}
except (NoTranscriptFound, TranscriptsDisabled):
result = {'answer': f'I cannot find a transcript for {video_ids[0]}. Try another video.'}
else:
result = self.qa({"question": query, "chat_history": chat_history})
return result
import sys
def main():
summarizer = Summarizer(openai_api_key="sk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX") #You need to replace this with your OpenAI API Key
# Get YouTube video URL from user
video_url = input("Please enter the YouTube video URL: ")
# Extract video IDs from the URL
video_ids = summarizer.extract_youtube_ids(video_url)
if not video_ids:
print("Invalid YouTube video URL.")
sys.exit()
# Summarize the video
try:
summary = summarizer.add_video(video_ids[0])
print("\nSummary:\n", summary)
except Exception as error:
print(f"Error: {error}")
if name == "main": main()
@hashimotogigantes
your retrieve_video()
worked just fine after i applied the change previously suggested YouTubeTranscriptApi.get_transcript(video_id, ["ja"])
, so i dont think its from get_transcript()
note:
you should carefully double check where can only concatenate str (not "list") to str
is coming out from.
seems like you are dangling around the outputs with lists when you are trying to putting it into vectorstore, so if i were you i'd check up that part again. ( its gonna be off topic so i'd stop here )
good luck!
I have tried the code you suggested at least 10 times. Today I tried again and got the same error. Perhaps this is caused by environmental differences. I am running in Vislal Studio 1.81.1. Here are the versions of the modules.
openai==0.27.9 youtube-transcript-api==0.6.1 pytube==15.0.0 tiktoken==0.4.0
@hashimotogigantes The code you provided still contains YouTubeTranscriptApi.get_transcript(video_id, 'ja')
. If you change it to YouTubeTranscriptApi.get_transcript(video_id, ['ja'])
I am pretty sure that it will work. As @loonip pointed out I am also very sure that can only concatenate str (not "list") to str
is not caused by this module. Please check the stacktrace to see where and why the error is actually thrown.
I will close this now due to inactivity.
I have tried many times to develop the following various patterns as follows, all of which failed, so I abandoned the development. The reason I included the pattern you mentioned in the code the other day is because it is relatively less problematic. I appreciate the information you all have provided, though,I abandoned development on this module.
#transcript = YouTubeTranscriptApi.get_transcript(video_id,languages=["ja"])
#transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['ja','en'])
#transcript = YouTubeTranscriptApi.get_transcript(video_id, languages='ja')
transcript = YouTubeTranscriptApi.get_transcript(video_id, "ja")
#transcript = YouTubeTranscriptApi.get_transcript(video_id,["ja"])
#transcript = YouTubeTranscriptApi.get_transcript("x1u8Ppvq13E",["ja"])
#transcript = YouTubeTranscriptApi.get_transcript("x1u8Ppvq13E",{"ja"})
#transcript = transcript_list.find_transcript("ja").fetch()
video id
x1u8Ppvq13E
What code / cli command are you executing?
I am running YouTubeTranscriptApi.get_transcript
Which Python version are you using?
Python 3.11.4
Which version of youtube-transcript-api are you using?
youtube-transcript-api 0.6.1
Expected behavior
I expected to get the Japanese transcript.
Actual behaviour
Instead I received the following error message:
error message
Error: Could not retrieve a transcript for the video https://www.youtube.com/watch?v=x1u8Ppvq13E! This is most likely caused by:
No transcripts were found for any of the requested language codes: ja
For this video (x1u8Ppvq13E) transcripts are available in the following languages:
(MANUALLY CREATED) None
(GENERATED)
(TRANSLATION LANGUAGES)
Obviously, this error message is contradictory. Note that the same video is successfully summarized by YT Summarizer plugin in ChatGPT.