run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.37k stars 4.98k forks source link

[Bug]: AsyncWebPageReader states that pdf-urls or urls redirecting to pdfs are not valid #15625

Open lucas-trueffles opened 2 weeks ago

lucas-trueffles commented 2 weeks ago

Bug Description

When using the aload method of the AsyncWebPageReader and one of the urls is:

then the aload method will throw an exception stating that "One of the inputs is not a valid url".

It seems to work when using the load method of the SimpleWebPageReader.

I am using python version 3.12 and llama-index-readers-web version 0.2.1

Version

0.11.1

Steps to Reproduce

Execute this code:

from llama_index.readers.web import AsyncWebPageReader, SimpleWebPageReader
import asyncio

async def test_async():
    async_reader = AsyncWebPageReader()
    docs = await async_reader.aload_data(urls=[
        "https://docs.llamaindex.ai/en/stable/api_reference/readers/web/#llama_index.readers.web.AsyncWebPageReader",
        "https://www.llamaindex.ai/files/privacy-notice.pdf"
    ])

def test():
    simple_reader = SimpleWebPageReader()
    docs = simple_reader.load_data(urls=[
        "https://docs.llamaindex.ai/en/stable/api_reference/readers/web/#llama_index.readers.web.AsyncWebPageReader",
        "https://www.llamaindex.ai/files/privacy-notice.pdf"
    ])
# Works
test()
#Does not work. Error: "ValueError: One of the inputs is not a valid url: https://www.llamaindex.ai/files/privacy-notice.pdf"
asyncio.run(test_async())

Relevant Logs/Tracbacks

Traceback (most recent call last):
  File "llamatest/main.py", line 21, in <module>
    asyncio.run(test_async())
  File "/usr/lib64/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "llamatest/main.py", line 7, in test_async
    docs = await async_reader.aload_data(urls=[
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "llamatest/venv/lib64/python3.12/site-packages/llama_index/readers/web/async_web/base.py", line 88, in aload_data
    raise ValueError(f"One of the inputs is not a valid url: {urls[i]}")
ValueError: One of the inputs is not a valid url: https://www.llamaindex.ai/files/privacy-notice.pdf
dosubot[bot] commented 2 weeks ago

Hi @lucas-trueffles! I'm here to help you with any bugs, questions, or contributions. Let's work together to solve the issue you're facing.

The aload method of AsyncWebPageReader throws a "One of the inputs is not a valid url" exception when given a direct URL to a .pdf file or a URL redirecting to a .pdf file because the method expects the response to be a tuple containing the response object and the raw page content as text. If the URL points to a .pdf file, the response content is not text, which causes the method to fail the type check and raise the exception.

Here is the relevant part of the code:

for i, response_tuple in enumerate(responses):
    if not isinstance(response_tuple, tuple):
        raise ValueError(f"One of the inputs is not a valid url: {urls[i]}")

In contrast, the load method of SimpleWebPageReader might not have such strict type checks or might handle non-text responses differently, allowing it to work fine with .pdf URLs [1].

To continue talking to Dosu, mention @dosu.