Open lucas-trueffles opened 2 weeks ago
Hi @lucas-trueffles! I'm here to help you with any bugs, questions, or contributions. Let's work together to solve the issue you're facing.
The aload
method of AsyncWebPageReader
throws a "One of the inputs is not a valid url" exception when given a direct URL to a .pdf file or a URL redirecting to a .pdf file because the method expects the response to be a tuple containing the response object and the raw page content as text. If the URL points to a .pdf file, the response content is not text, which causes the method to fail the type check and raise the exception.
Here is the relevant part of the code:
for i, response_tuple in enumerate(responses):
if not isinstance(response_tuple, tuple):
raise ValueError(f"One of the inputs is not a valid url: {urls[i]}")
In contrast, the load
method of SimpleWebPageReader
might not have such strict type checks or might handle non-text responses differently, allowing it to work fine with .pdf URLs [1].
To continue talking to Dosu, mention @dosu.
Bug Description
When using the aload method of the AsyncWebPageReader and one of the urls is:
then the aload method will throw an exception stating that "One of the inputs is not a valid url".
It seems to work when using the load method of the SimpleWebPageReader.
I am using python version 3.12 and llama-index-readers-web version 0.2.1
Version
0.11.1
Steps to Reproduce
Execute this code:
Relevant Logs/Tracbacks