Header value is too long error when using AsyncHtmlLoader

rothnic commented 9 months ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.

Example Code

A simplified example pulled almost straight from here, but fails on the walmart.com page for some reason.

from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.document_transformers import Html2TextTransformer

urls = ['https://www.walmart.com/shop/deals']

loader = AsyncHtmlLoader(urls)
docs = loader.load()

html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)

print(docs_transformed[0].page_content)

Description

AsyncHtmlLoader fails to load https://www.walmart.com/shop/deals, but works for other urls I tested
I search for the error, but couldn't find documentation on how I'd avoid the issue with AsyncHtmlLoader
I would expect AsyncHtmlLoader to never fail to load a webpage due to a technical error. I could see if the request was blocked in some way or another

Error:

(crewai) Nicks-Macbook-Pro-4:crewai nroth$ /opt/miniconda3/envs/crewai/bin/python /Users/nroth/workspace/crewai/html2text_example.py
Fetching pages:   0%|                                                                                                     | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/opt/miniconda3/envs/crewai/lib/python3.12/site-packages/langchain_community/document_loaders/async_html.py", line 206, in load
    asyncio.get_running_loop()
RuntimeError: no running event loop

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/miniconda3/envs/crewai/lib/python3.12/site-packages/aiohttp/client_reqrep.py", line 965, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/crewai/lib/python3.12/site-packages/aiohttp/streams.py", line 622, in read
    await self._waiter
  File "/opt/miniconda3/envs/crewai/lib/python3.12/site-packages/aiohttp/client_proto.py", line 224, in data_received
    messages, upgraded, tail = self._parser.feed_data(data)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "aiohttp/_http_parser.pyx", line 557, in aiohttp._http_parser.HttpParser.feed_data
  File "aiohttp/_http_parser.pyx", line 732, in aiohttp._http_parser.cb_on_header_value
aiohttp.http_exceptions.LineTooLong: 400, message:
  Got more than 8190 bytes (9515) when reading Header value is too long.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/nroth/workspace/crewai/html2text_example.py", line 9, in <module>
    docs = loader.load()
           ^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/crewai/lib/python3.12/site-packages/langchain_community/document_loaders/async_html.py", line 213, in load
    results = asyncio.run(self.fetch_all(self.web_paths))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/crewai/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/crewai/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/crewai/lib/python3.12/asyncio/base_events.py", line 684, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/crewai/lib/python3.12/site-packages/langchain_community/document_loaders/async_html.py", line 189, in fetch_all
    return await tqdm_asyncio.gather(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/crewai/lib/python3.12/site-packages/tqdm/asyncio.py", line 79, in gather
    res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
           ^^^^^^^
  File "/opt/miniconda3/envs/crewai/lib/python3.12/asyncio/tasks.py", line 631, in _wait_for_one
    return f.result()  # May raise f.exception().
           ^^^^^^^^^^
  File "/opt/miniconda3/envs/crewai/lib/python3.12/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
    return i, await f
              ^^^^^^^
  File "/opt/miniconda3/envs/crewai/lib/python3.12/site-packages/langchain_community/document_loaders/async_html.py", line 177, in _fetch_with_rate_limit
    return await self._fetch(url)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/crewai/lib/python3.12/site-packages/langchain_community/document_loaders/async_html.py", line 148, in _fetch
    async with session.get(
  File "/opt/miniconda3/envs/crewai/lib/python3.12/site-packages/aiohttp/client.py", line 1187, in __aenter__
    self._resp = await self._coro
                 ^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/crewai/lib/python3.12/site-packages/aiohttp/client.py", line 601, in _request
    await resp.start(conn)
  File "/opt/miniconda3/envs/crewai/lib/python3.12/site-packages/aiohttp/client_reqrep.py", line 967, in start
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 400, message='Got more than 8190 bytes (9515) when reading Header value is too long.', url=URL('https://www.walmart.com/shop/deals')

System Info

I was using this after installing the latest version of crewai, so my langchain version might not be the absolute latest.

python --version

Python 3.12.1

pip freeze | grep langchain

langchain==0.1.1
langchain-community==0.0.13
langchain-core==0.1.13
langchain-openai==0.0.2.post1

system_profiler SPSoftwareDataType SPHardwareDataType

Software:

    System Software Overview:

      System Version: macOS 12.5.1 (21G83)
      Kernel Version: Darwin 21.6.0
      Secure Virtual Memory: Enabled
      System Integrity Protection: Enabled

Hardware:

    Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: MacBookPro16,1
      Processor Name: 8-Core Intel Core i9
      Processor Speed: 2.3 GHz
      Number of Processors: 1
      Total Number of Cores: 8
      L2 Cache (per Core): 256 KB
      L3 Cache: 16 MB
      Hyper-Threading Technology: Enabled
      Memory: 32 GB

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[X] Document Loaders
[ ] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[X] Async

dosubot[bot] commented 9 months ago

🤖

Hello @rothnic! I'm Dosu, a friendly bot here to assist you with your LangChain issues. As we wait for a human maintainer, I'm on hand to help you debug, answer any questions you may have, or even guide you on how to contribute to the repository. Let's get started!

The "Header value is too long" error you're encountering is likely due to the website (https://www.walmart.com/shop/deals) rejecting the request because of the headers being sent. The headers are set in the AsyncHtmlLoader class's __init__ method, where it uses the default_header_template or a provided header_template.

The default_header_template includes a User-Agent string, which is set to a random value from the fake_useragent library if it's installed. If the fake_useragent library isn't installed, the User-Agent string is left as an empty string. This could potentially cause issues if the website rejects requests with an empty User-Agent string.

To avoid this issue, you could try setting a specific User-Agent string that you know the website accepts. If you're not sure what User-Agent string to use, you could try using the User-Agent string of a popular web browser.

Here's how you could modify the AsyncHtmlLoader class's __init__ method to set a specific User-Agent string:

def __init__(
    self,
    web_path: Union[str, List[str]],
    header_template: Optional[dict] = None,
    verify_ssl: Optional[bool] = True,
    proxies: Optional[dict] = None,
    autoset_encoding: bool = True,
    encoding: Optional[str] = None,
    default_parser: str = "html.parser",
    requests_per_second: int = 2,
    requests_kwargs: Optional[Dict[str, Any]] = None,
    raise_for_status: bool = False,
    ignore_load_errors: bool = False,
):
    # ...
    headers = header_template or default_header_template
    if not headers.get("User-Agent"):
        headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
    # ...

This sets the User-Agent string to the User-Agent string of Google Chrome on Windows 10. You may need to adjust this to a User-Agent string that the website accepts.

As for the maximum header size that the AsyncHtmlLoader can handle, I wasn't able to find this information in the LangChain repository. It's possible that this limit is set by the aiohttp library that AsyncHtmlLoader uses, or by the server that you're sending requests to.

I hope this helps! If you have any other questions, feel free to ask.

Source Code

Sources

libs/community/langchain_community/document_loaders/async_html.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

rothnic commented 9 months ago

Comments on the bot's response above:

Issue still occurs, even with fake_useragent installed or using a working user agent
If the error is due to the page rejecting the request, shouldn't the error message reflect that? I wasn't able to find this as a potential cause of the error when searching google for the error message that was returned

dosubot[bot] commented 9 months ago

🤖

It seems like this issue requires a deeper look, @baskaryan could you please lend your expertise?

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

langchain-ai / langchain