Closed rothnic closed 6 months ago
🤖
Hello @rothnic! I'm Dosu, a friendly bot here to assist you with your LangChain issues. As we wait for a human maintainer, I'm on hand to help you debug, answer any questions you may have, or even guide you on how to contribute to the repository. Let's get started!
The "Header value is too long" error you're encountering is likely due to the website (https://www.walmart.com/shop/deals) rejecting the request because of the headers being sent. The headers are set in the AsyncHtmlLoader
class's __init__
method, where it uses the default_header_template
or a provided header_template
.
The default_header_template
includes a User-Agent string, which is set to a random value from the fake_useragent
library if it's installed. If the fake_useragent
library isn't installed, the User-Agent string is left as an empty string. This could potentially cause issues if the website rejects requests with an empty User-Agent string.
To avoid this issue, you could try setting a specific User-Agent string that you know the website accepts. If you're not sure what User-Agent string to use, you could try using the User-Agent string of a popular web browser.
Here's how you could modify the AsyncHtmlLoader
class's __init__
method to set a specific User-Agent string:
def __init__(
self,
web_path: Union[str, List[str]],
header_template: Optional[dict] = None,
verify_ssl: Optional[bool] = True,
proxies: Optional[dict] = None,
autoset_encoding: bool = True,
encoding: Optional[str] = None,
default_parser: str = "html.parser",
requests_per_second: int = 2,
requests_kwargs: Optional[Dict[str, Any]] = None,
raise_for_status: bool = False,
ignore_load_errors: bool = False,
):
# ...
headers = header_template or default_header_template
if not headers.get("User-Agent"):
headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
# ...
This sets the User-Agent string to the User-Agent string of Google Chrome on Windows 10. You may need to adjust this to a User-Agent string that the website accepts.
As for the maximum header size that the AsyncHtmlLoader
can handle, I wasn't able to find this information in the LangChain repository. It's possible that this limit is set by the aiohttp library that AsyncHtmlLoader
uses, or by the server that you're sending requests to.
I hope this helps! If you have any other questions, feel free to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Comments on the bot's response above:
🤖
It seems like this issue requires a deeper look, @baskaryan could you please lend your expertise?
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Checked other resources
Example Code
A simplified example pulled almost straight from here, but fails on the walmart.com page for some reason.
Description
Error:
System Info
I was using this after installing the latest version of crewai, so my langchain version might not be the absolute latest.
Related Components