Closed Greatz08 closed 8 months ago
🤖
Hey there @HakaishinShwet! Fancy seeing you back here 😎
Based on the information provided, it seems that the web scraper in the dify repository might not be working on http websites due to the lack of support for http in the get_url
function. This function uses the requests
library to fetch the URL and return its contents as a string. However, the requests
library does not support http by default due to security reasons.
Here is the get_url
function:
def get_url(url: str, user_agent: str = None) -> str:
"""Fetch URL and return the contents as a string."""
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
if user_agent:
headers["User-Agent"] = user_agent
supported_content_types = file_extractor.SUPPORT_URL_CONTENT_TYPES + ["text/html"]
head_response = requests.head(url, headers=headers, allow_redirects=True, timeout=(5, 10))
if head_response.status_code != 200:
return "URL returned status code {}.".format(head_response.status_code)
# check content-type
main_content_type = head_response.headers.get('Content-Type').split(';')[0].strip()
if main_content_type not in supported_content_types:
return "Unsupported content-type [{}] of URL.".format(main_content_type)
if main_content_type in file_extractor.SUPPORT_URL_CONTENT_TYPES:
return FileExtractor.load_from_url(url, return_text=True)
response = requests.get(url, headers=headers, allow_redirects=True, timeout=(5, 30))
a = extract_using_readabilipy(response.text)
if not a['plain_text'] or not a['plain_text'].strip():
return get_url_from_newspaper3k(url)
res = FULL_TEMPLATE.format(
title=a['title'],
authors=a['byline'],
publish_date=a['date'],
top_image="",
text=a['plain_text'] if a['plain_text'] else "",
)
return res
To fix this issue, you might need to add support for http in the get_url
function. However, please note that using http instead of https can expose your application to security risks such as man-in-the-middle attacks. It's recommended to use https whenever possible.
You can find this function in the web_reader_tool.py file.
Also, the dify repository does have a specific module dedicated to handling web scraping tasks. This is evident from the provided context where a YAML file named 'webscraper.yaml' is present in the path 'api/core/tools/provider/builtin/webscraper/'. The description within the file also confirms that this module is a "Web Scrapper tool kit used to scrape web". You can find this file in the webscraper.yaml path.
I hope this helps! If you have any other questions, feel free to ask.
Which website you are trying to crawl? could you please provide me with a reference link to it? so that I can have a test.
http://www.jamessawyer.co.uk/pub/gpt_jb.html this was the website i was trying to see if scraper can fetch info and it did give certificate error @Yeuoly
Which model are you using? It does work in my local environment, the model you are using may transform the schema from http into https.
@Yeuoly i was using google gemini so i guess this is model issue. Thanks for testing and telling the cause of issue. If you want to add any more information from testing regarding this issue then you can add here in comment else you can close this issue :-)
Thanks for your reports.
Self Checks
Provide a description of the help you need
any reasons for scraper not working on http websites :-)