langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
48.5k stars 6.94k forks source link

[just question] web scraper not working for http based urls? #2471

Closed Greatz08 closed 8 months ago

Greatz08 commented 8 months ago

Self Checks

Provide a description of the help you need

any reasons for scraper not working on http websites :-)

dosubot[bot] commented 8 months ago

🤖

Hey there @HakaishinShwet! Fancy seeing you back here 😎

Based on the information provided, it seems that the web scraper in the dify repository might not be working on http websites due to the lack of support for http in the get_url function. This function uses the requests library to fetch the URL and return its contents as a string. However, the requests library does not support http by default due to security reasons.

Here is the get_url function:

def get_url(url: str, user_agent: str = None) -> str:
    """Fetch URL and return the contents as a string."""
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    if user_agent:
        headers["User-Agent"] = user_agent

    supported_content_types = file_extractor.SUPPORT_URL_CONTENT_TYPES + ["text/html"]

    head_response = requests.head(url, headers=headers, allow_redirects=True, timeout=(5, 10))

    if head_response.status_code != 200:
        return "URL returned status code {}.".format(head_response.status_code)

    # check content-type
    main_content_type = head_response.headers.get('Content-Type').split(';')[0].strip()
    if main_content_type not in supported_content_types:
        return "Unsupported content-type [{}] of URL.".format(main_content_type)

    if main_content_type in file_extractor.SUPPORT_URL_CONTENT_TYPES:
        return FileExtractor.load_from_url(url, return_text=True)

    response = requests.get(url, headers=headers, allow_redirects=True, timeout=(5, 30))
    a = extract_using_readabilipy(response.text)

    if not a['plain_text'] or not a['plain_text'].strip():
        return get_url_from_newspaper3k(url)

    res = FULL_TEMPLATE.format(
        title=a['title'],
        authors=a['byline'],
        publish_date=a['date'],
        top_image="",
        text=a['plain_text'] if a['plain_text'] else "",
    )

    return res

To fix this issue, you might need to add support for http in the get_url function. However, please note that using http instead of https can expose your application to security risks such as man-in-the-middle attacks. It's recommended to use https whenever possible.

You can find this function in the web_reader_tool.py file.

Also, the dify repository does have a specific module dedicated to handling web scraping tasks. This is evident from the provided context where a YAML file named 'webscraper.yaml' is present in the path 'api/core/tools/provider/builtin/webscraper/'. The description within the file also confirms that this module is a "Web Scrapper tool kit used to scrape web". You can find this file in the webscraper.yaml path.

I hope this helps! If you have any other questions, feel free to ask.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
Yeuoly commented 8 months ago

Which website you are trying to crawl? could you please provide me with a reference link to it? so that I can have a test.

Greatz08 commented 8 months ago

http://www.jamessawyer.co.uk/pub/gpt_jb.html this was the website i was trying to see if scraper can fetch info and it did give certificate error @Yeuoly

Yeuoly commented 8 months ago

Which model are you using? It does work in my local environment, the model you are using may transform the schema from http into https.

image
Greatz08 commented 8 months ago

@Yeuoly i was using google gemini so i guess this is model issue. Thanks for testing and telling the cause of issue. If you want to add any more information from testing regarding this issue then you can add here in comment else you can close this issue :-)

Yeuoly commented 8 months ago

Thanks for your reports.