langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
89.1k stars 14.03k forks source link

Langchain YouTube audio loader duplicating transcripts #22671

Open Kubululo opened 1 month ago

Kubululo commented 1 month ago

Checked other resources

Example Code

youtube_url = 'https://youtu.be/RXQ5AtjUMAw'

loader = GenericLoader(
  YoutubeAudioLoader(
      [youtube_url],
      './videos'
  ),
  OpenAIWhisperParser(
      api_key=key,
      language='en'
  )
)

loader.load()

Error Message and Stack Trace (if applicable)

Transcribing part 1!
Transcribing part 2!
Transcribing part 1!
Transcribing part 1!
Transcribing part 3!
Transcribing part 3!

Description

image

System Info

System info: Python 3.11.9 inside PyCharm venv

langchain==0.2.1 langchain-community==0.2.1 langchain-core==0.2.1 langchain-openai==0.1.7 langchain-text-splitters==0.2.0 langgraph==0.0.55 langsmith==0.1.63

pow3rpi commented 1 month ago

Hi there,

I've reproduced your code with the same version of python and the same versions of libraries in PyCharm. Actually, I haven't faced any issues like you have.

Here is my full code:

from langchain_community.document_loaders.blob_loaders.youtube_audio import (
    YoutubeAudioLoader,
)
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import OpenAIWhisperParser

OPENAI_API_KEY = '...'

youtube_url = 'https://youtu.be/RXQ5AtjUMAw'

loader = GenericLoader(
    YoutubeAudioLoader(
        [youtube_url],
        './videos'
    ),
    OpenAIWhisperParser(
        api_key=OPENAI_API_KEY,
        language='fr'
    )
)

docs = loader.load()
print(docs)

Here is my full list of dependencies as we additionally need to install pydub and yt_dlp:

aiohttp==3.9.5
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.4.0
attrs==23.2.0
Brotli==1.1.0
certifi==2024.6.2
charset-normalizer==3.3.2
dataclasses-json==0.6.7
distro==1.9.0
frozenlist==1.4.1
greenlet==3.0.3
h11==0.14.0
httpcore==1.0.5
httpx==0.27.0
idna==3.7
jsonpatch==1.33
jsonpointer==3.0.0
langchain==0.2.1
langchain-community==0.2.1
langchain-core==0.2.1
langchain-openai==0.1.7
langchain-text-splitters==0.2.0
langgraph==0.0.55
langsmith==0.1.63
marshmallow==3.21.3
multidict==6.0.5
mutagen==1.47.0
mypy-extensions==1.0.0
numpy==1.26.4
openai==1.34.0
orjson==3.10.5
packaging==23.2
pycryptodomex==3.20.0
pydantic==2.7.4
pydantic_core==2.18.4
pydub==0.25.1
PyYAML==6.0.1
regex==2024.5.15
requests==2.32.3
sniffio==1.3.1
SQLAlchemy==2.0.30
tenacity==8.3.0
tiktoken==0.7.0
tqdm==4.66.4
typing-inspect==0.9.0
typing_extensions==4.12.2
urllib3==2.2.1
uuid6==2024.1.12
websockets==12.0
yarl==1.9.4
yt-dlp==2024.5.27

Also, I'd like to mention that I tried to run the code several times and use different values for language parameter, so every time I get the same output: output

Kubululo commented 4 weeks ago

What I've observed is that this only happens with videos longer than 60 minutes so maybe that can help with diagnosis