langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
42.61k stars 5.89k forks source link

WebscraperTool can not bypass Cloudflare #6232

Closed Weaxs closed 1 month ago

Weaxs commented 1 month ago

Self Checks

1. Is this request related to a challenge you're experiencing? Tell me about your story.

WebscraperTool can not bypass the Cloudflare basic WAF protection.

Whether use webscraper tool in workflow or in agent, this problem can be reproduced.

Cloudflare JavaScript detections

image

image

image

2. Additional context or comments

I found some module for fix this, It may be useful for reference:

3. Can you help us with this feature?

dosubot[bot] commented 1 month ago

To modify the WebscraperTool to bypass Cloudflare's basic WAF protection, you can integrate the cloudscraper module into the _invoke method. Here’s how you can do it:

  1. Install the cloudscraper module:

    pip install cloudscraper
  2. Modify the WebscraperTool class to use cloudscraper for fetching the webpage:

from typing import Any, Union
import cloudscraper

from core.tools.entities.tool_entities import ToolInvokeMessage
from core.tools.errors import ToolInvokeError
from core.tools.tool.builtin_tool import BuiltinTool

class WebscraperTool(BuiltinTool):
    def _invoke(self,
                user_id: str,
                tool_parameters: dict[str, Any],
                ) -> Union[ToolInvokeMessage, list[ToolInvokeMessage]]:
        """
            invoke tools
        """
        try:
            url = tool_parameters.get('url', '')
            user_agent = tool_parameters.get('user_agent', '')
            if not url:
                return self.create_text_message('Please input url')

            # Use cloudscraper to get webpage
            scraper = cloudscraper.create_scraper()
            headers = {'User-Agent': user_agent} if user_agent else {}
            result = scraper.get(url, headers=headers).text

            if tool_parameters.get('generate_summary'):
                # summarize and return
                return self.create_text_message(self.summary(user_id=user_id, content=result))
            else:
                # return full webpage
                return self.create_text_message(result)
        except Exception as e:
            raise ToolInvokeError(str(e))

This modification uses cloudscraper to handle the HTTP request, which can bypass Cloudflare's basic WAF protection. The cloudscraper.create_scraper() function creates a session that can handle Cloudflare's anti-bot page. The headers dictionary is used to set the User-Agent if provided [1].

To continue talking to Dosu, mention @dosu.

Weaxs commented 1 month ago

I config the Cloudflare WAF rules for skipping webscraper requests, but it's not work.

image

image