WebscraperTool can not bypass Cloudflare

Weaxs commented 1 month ago

Self Checks

[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

WebscraperTool can not bypass the Cloudflare basic WAF protection.

Whether use webscraper tool in workflow or in agent, this problem can be reproduced.

Cloudflare JavaScript detections

2. Additional context or comments

I found some module for fix this, It may be useful for reference:

3. Can you help us with this feature?

[X] I am interested in contributing to this feature.

dosubot[bot] commented 1 month ago

To modify the WebscraperTool to bypass Cloudflare's basic WAF protection, you can integrate the cloudscraper module into the _invoke method. Here’s how you can do it:

Install the cloudscraper module:
```
pip install cloudscraper
```
Modify the WebscraperTool class to use cloudscraper for fetching the webpage:

from typing import Any, Union
import cloudscraper

from core.tools.entities.tool_entities import ToolInvokeMessage
from core.tools.errors import ToolInvokeError
from core.tools.tool.builtin_tool import BuiltinTool

class WebscraperTool(BuiltinTool):
    def _invoke(self,
                user_id: str,
                tool_parameters: dict[str, Any],
                ) -> Union[ToolInvokeMessage, list[ToolInvokeMessage]]:
        """
            invoke tools
        """
        try:
            url = tool_parameters.get('url', '')
            user_agent = tool_parameters.get('user_agent', '')
            if not url:
                return self.create_text_message('Please input url')

            # Use cloudscraper to get webpage
            scraper = cloudscraper.create_scraper()
            headers = {'User-Agent': user_agent} if user_agent else {}
            result = scraper.get(url, headers=headers).text

            if tool_parameters.get('generate_summary'):
                # summarize and return
                return self.create_text_message(self.summary(user_id=user_id, content=result))
            else:
                # return full webpage
                return self.create_text_message(result)
        except Exception as e:
            raise ToolInvokeError(str(e))

This modification uses cloudscraper to handle the HTTP request, which can bypass Cloudflare's basic WAF protection. The cloudscraper.create_scraper() function creates a session that can handle Cloudflare's anti-bot page. The headers dictionary is used to set the User-Agent if provided ^[1].

To continue talking to Dosu, mention @dosu.

Weaxs commented 1 month ago

I config the Cloudflare WAF rules for skipping webscraper requests, but it's not work.

langgenius / dify