langchain.document_loaders.generic GenericLoader not working on Azure OpenAI - InvalidRequestError: Resource Not Found, cannot detect declared resource

marielaquino commented 1 year ago

System Info

langchain=0.0.225, python=3.9.17, openai=0.27.8 openai.api_type = "azure", openai.api_version = "2023-05-15" api_base, api_key, deployment_name environment variables all configured.

Who can help?

No response

Information

[X] The official example notebooks/scripts
[X] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[X] Document Loaders
[ ] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

Steps to reproduce the behavior: Note: This code is pulled directly from document loaders chapter of Langchain Chat With Your Data course with Harrison Chase and Andrew Ng. It downloads an audio file of a public youtube video and generates a transcript.

In a Jupyter notebook, configure your Azure OpenAI environment variables and add this code:

from langchain.document_loaders.generic import GenericLoader 
from langchain.document_loaders.parsers import OpenAIWhisperParser 
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

Create and run a new cell with this inside:

url="<https://www.youtube.com/watch?v=jGwO_UgTS7I>" 
save_dir="docs/youtube/" 
loader = GenericLoader( YoutubeAudioLoader([url],save_dir), OpenAIWhisperParser() ) 
docs = loader.load()

At the transcribing step, it will fail on "InvalidRequestError".

Successfully executes the following steps:

[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage
[youtube] jGwO_UgTS7I: Downloading ios player API JSON
[youtube] jGwO_UgTS7I: Downloading android player API JSON
[youtube] jGwO_UgTS7I: Downloading m3u8 information
[info] jGwO_UgTS7I: Downloading 1 format(s): 140
[download] docs/youtube//Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a has already been downloaded
[download] 100% of   69.76MiB
[ExtractAudio] Not converting audio docs/youtube//Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a; file is already in target format m4a
Transcribing part 1!

InvalidRequestError                       Traceback (most recent call last)
Cell In[14], line 8
      3 save_dir="docs/youtube/"
      4 loader = GenericLoader(
      5     YoutubeAudioLoader([url],save_dir),
      6     OpenAIWhisperParser()
      7 )
----> 8 docs = loader.load()

File /usr/local/lib/python3.9/site-packages/langchain/document_loaders/generic.py:90, in GenericLoader.load(self)
     88 def load(self) -> List[Document]:
     89     """Load all documents."""
---> 90     return list(self.lazy_load())

File /usr/local/lib/python3.9/site-packages/langchain/document_loaders/generic.py:86, in GenericLoader.lazy_load(self)
     84 """Load documents lazily. Use this when working at a large scale."""
     85 for blob in self.blob_loader.yield_blobs():
---> 86     yield from self.blob_parser.lazy_parse(blob)

File /usr/local/lib/python3.9/site-packages/langchain/document_loaders/parsers/audio.py:51, in OpenAIWhisperParser.lazy_parse(self, blob)
     49 # Transcribe
     50 print(f"Transcribing part {split_number+1}!")
---> 51 transcript = openai.Audio.transcribe("whisper-1", file_obj)
     53 yield Document(
     54     page_content=transcript.text,
     55     metadata={"source": blob.source, "chunk": split_number},
     56 )

File /usr/local/lib/python3.9/site-packages/openai/api_resources/audio.py:65, in Audio.transcribe(cls, model, file, api_key, api_base, api_type, api_version, organization, **params)
     55 requestor, files, data = cls._prepare_request(
     56     file=file,
     57     filename=file.name,
   (...)
     62     **params,
     63 )
     64 url = cls._get_url("transcriptions")
---> 65 response, _, api_key = requestor.request("post", url, files=files, params=data)
     66 return util.convert_to_openai_object(
     67     response, api_key, api_version, organization
     68 )

File /usr/local/lib/python3.9/site-packages/openai/api_requestor.py:298, in APIRequestor.request(self, method, url, params, headers, files, stream, request_id, request_timeout)
    277 def request(
    278     self,
    279     method,
   (...)
    286     request_timeout: Optional[Union[float, Tuple[float, float]]] = None,
    287 ) -> Tuple[Union[OpenAIResponse, Iterator[OpenAIResponse]], bool, str]:
    288     result = self.request_raw(
    289         method.lower(),
    290         url,
   (...)
    296         request_timeout=request_timeout,
    297     )
--> 298     resp, got_stream = self._interpret_response(result, stream)
    299     return resp, got_stream, self.api_key

File /usr/local/lib/python3.9/site-packages/openai/api_requestor.py:700, in APIRequestor._interpret_response(self, result, stream)
    692     return (
    693         self._interpret_response_line(
    694             line, result.status_code, result.headers, stream=True
    695         )
    696         for line in parse_stream(result.iter_lines())
    697     ), True
    698 else:
    699     return (
--> 700         self._interpret_response_line(
    701             result.content.decode("utf-8"),
    702             result.status_code,
    703             result.headers,
    704             stream=False,
    705         ),
    706         False,
    707     )

File /usr/local/lib/python3.9/site-packages/openai/api_requestor.py:763, in APIRequestor._interpret_response_line(self, rbody, rcode, rheaders, stream)
    761 stream_error = stream and "error" in resp.data
    762 if stream_error or not 200 <= rcode < 300:
--> 763     raise self.handle_error_response(
    764         rbody, rcode, resp.data, rheaders, stream_error=stream_error
    765     )
    766 return resp

InvalidRequestError: Resource not found

Usually, with "resource not found" errors, the message will tell you to input api_key or deployment_name. I'm not sure what this means, as none of the Loader methods take in these as params.

Expected behavior

Expected behavior is to finish four parts of transcription and "load" as doc in docs variable.

Daily-Prophet commented 1 year ago

Same issue here, is this because azure open ai sudio has no whisper model deployment?

dosubot[bot] commented 1 year ago

Hi, @marielaquino! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you reported is related to the GenericLoader not working on Azure OpenAI. It seems to be throwing an InvalidRequestError: Resource Not Found when trying to transcribe an audio file. You mentioned that you are unsure why this error is occurring, as none of the Loader methods take in the api_key or deployment_name as parameters. Another user, @Daily-Prophet, has also experienced the same issue and wonders if it is because Azure OpenAI Studio has no whisper model deployment.

Before we proceed, we would like to confirm if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and cooperation. We look forward to hearing from you soon.

sergiofgonzalez commented 1 year ago

I confirm that this issue is still happening on langchain v0.0.320, where you still get the error Attempt 1 failed. Exception: Resource not found when you try to use the GenericLoader with Azure OpenAI credentials, as in the snippet below:

loader = GenericLoader(
    YoutubeAudioLoader([url], save_dir),
    OpenAIWhisperParser()
)

dosubot[bot] commented 1 year ago

@baskaryan Could you please help @marielaquino with this issue? They have confirmed that the issue is still occurring in the latest version of LangChain (v0.0.320). They are getting the error Attempt 1 failed. Exception: Resource not found when using the GenericLoader with Azure OpenAI credentials. Thank you!

dosubot[bot] commented 9 months ago

Hi, @marielaquino,

I'm helping the LangChain team manage their backlog and am marking this issue as stale.

It seems like the issue you reported regarding the GenericLoader not working on Azure OpenAI, resulting in an "InvalidRequestError: Resource Not Found" when attempting to transcribe an audio file from a public YouTube video, is still unresolved. Another user, @Daily-Prophet, also experienced the same issue and questioned if it's due to Azure OpenAI Studio lacking a whisper model deployment. The issue has been confirmed to persist in the latest version of the LangChain repository (v0.0.320) by @sergiofgonzalez. The LangChain team has been asked to assist with resolving the issue.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and cooperation. If you have any further questions or updates, feel free to reach out.

I'm Dosu

langchain-ai / langchain