webscraper error - Githubissues

xsxs89757 commented 9 months ago

Self Checks

[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to file this report (我已阅读并同意 Language Policy).

Dify version

0.5.0

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

Open the webscraper tool and let it grab the url Times error

✔️ Expected Behavior

search url

❌ Actual Behavior

tool invoke error: Command '['node', 'ExtractArticle.js', '-i', '/tmp/tmpsukdz5ht', '-o', '/tmp/tmpsukdz5ht.json']' died with <Signals.SIGABRT: 6>.

dosubot[bot] commented 9 months ago

🤖

Hello again, @xsxs89757! It's good to see you back here contributing to the dify project. Let's see what we can do about this issue you've encountered.

Based on the error message you provided, it seems like the ExtractArticle.js script is failing with a SIGABRT signal. This script is part of the webscraper tool in dify and is used to parse HTML content from webpages.

There could be several reasons for this issue:

Node.js Environment: The script is run using Node.js. If Node.js is not installed or not properly configured in the environment where the Python script is running, it could cause the script to fail.
Script File Location: The script is expected to be located in the javascript directory under the readabilipy module's path. If the script is not found in this location, it could cause the script to fail.
Input/Output File Issues: The script expects an input HTML file and writes the output to a JSON file. If there are any issues with file permissions, disk space, or other file I/O related issues, it could cause the script to fail.
HTML Parsing Issues: The script parses the HTML content of a webpage. If there are any issues with the HTML content (e.g., malformed HTML), it could cause the script to fail.

You can find the relevant code in the extract_using_readabilipy function in the web_reader_tool.py file.

To resolve this issue, please ensure that Node.js is installed and properly configured, the ExtractArticle.js script is in the correct location, and that there are no issues with file permissions or disk space. Also, verify that the HTML content being parsed is well-formed.

If the issue persists after checking these potential causes, please provide more details about your environment and the specific webpage you are trying to scrape. This will help in further diagnosing the problem.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

crazywoola commented 9 months ago

Can you provide the url you are using? @xsxs89757

xsxs89757 commented 9 months ago

Any url will report an error

xsxs89757 commented 9 months ago

In the same environment, I wrote a test script to ensure that there were no problems with the environment.

takatost commented 9 months ago

If you deploy using docker compose, will this issue come up? Guessing it might be related to the local nodejs environment, webscraper tool needs to run on nodejs code.

takatost commented 9 months ago

You can use node -v to check your NodeJS version, btw I'm using v18.19.0 here. If your version is lower than this, you might want to consider upgrading it.

xsxs89757 commented 9 months ago

node version 18.17.1, In the same environment, I wrote a python test script, and there is no problem running the same node command.

takatost commented 9 months ago

This issue depends on your local environment, it's quite a complex problem. The environment and dependencies should match as closely as possible when deploying from source code to our provided Dockerfile. Normally, we'd suggest users to deploy our service using Docker to ensure consistency of the environment.

xsxs89757 commented 8 months ago

Modified tools/utils/web_reader_tool.py

def extract_using_readabilipy(html):
    with tempfile.NamedTemporaryFile(delete=False, mode='w+') as f_html:
        f_html.write(html)
        f_html.close()
    html_path = f_html.name

    # Call Mozilla's Readability.js Readability.parse() function via node, writing output to a temporary file
    article_json_path = html_path + ".json"
    jsdir = os.path.join(find_module_path('readabilipy'), 'javascript')
    with chdir(jsdir):
        try:
            p = subprocess.check_output(["node", "ExtractArticle.js", "-i", html_path, "-o", article_json_path], stderr=subprocess.STDOUT, preexec_fn=os.setsid)
            print("OUT:", p)
        except Exception as e:
            print("ERR:", e)
            print(str(e.output))
        # subprocess.check_call(["node", "ExtractArticle.js", "-i", html_path, "-o", article_json_path])

    # Read output of call to Readability.parse() from JSON file and return as Python dictionary
    with open(article_json_path, "r", encoding="utf-8") as json_file:
        input_json = json.loads(json_file.read())

    # Deleting files after processing
    os.unlink(article_json_path)
    os.unlink(html_path)

    article_json = {
        "title": None,
        "byline": None,
        "date": None,
        "content": None,
        "plain_content": None,
        "plain_text": None
    }
    # Populate article fields from readability fields where present
    if input_json:
        if "title" in input_json and input_json["title"]:
            article_json["title"] = input_json["title"]
        if "byline" in input_json and input_json["byline"]:
            article_json["byline"] = input_json["byline"]
        if "date" in input_json and input_json["date"]:
            article_json["date"] = input_json["date"]
        if "content" in input_json and input_json["content"]:
            article_json["content"] = input_json["content"]
            article_json["plain_content"] = plain_content(article_json["content"], False, False)
            article_json["plain_text"] = extract_text_blocks_as_plain_text(article_json["plain_content"])
        if "textContent" in input_json and input_json["textContent"]:
            article_json["plain_text"] = input_json["textContent"]
            article_json["plain_text"] = re.sub(r'\n\s*\n', '\n', article_json["plain_text"])

    return article_json

log

 [on_tool_start] ToolCall:webscraper
0|flask-app  | {'url': 'https://www.baidu.com', 'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.1000.0 Safari/537.36'}
0|flask-app  | ERR: Command '['node', 'ExtractArticle.js', '-i', '/tmp/tmp_wc_0orp', '-o', '/tmp/tmp_wc_0orp.json']' died with <Signals.SIGABRT: 6>.
0|flask-app  | b''

After adding try except, the program no longer reports an error and can be used. This error report is meaningless.

takatost commented 8 months ago

Modified tools/utils/web_reader_tool.py


def extract_using_readabilipy(html):

    with tempfile.NamedTemporaryFile(delete=False, mode='w+') as f_html:

        f_html.write(html)

        f_html.close()

    html_path = f_html.name

    # Call Mozilla's Readability.js Readability.parse() function via node, writing output to a temporary file

    article_json_path = html_path + ".json"

    jsdir = os.path.join(find_module_path('readabilipy'), 'javascript')

    with chdir(jsdir):

        try:

            p = subprocess.check_output(["node", "ExtractArticle.js", "-i", html_path, "-o", article_json_path], stderr=subprocess.STDOUT, preexec_fn=os.setsid)

            print("OUT:", p)

        except Exception as e:

            print("ERR:", e)

            print(str(e.output))

        # subprocess.check_call(["node", "ExtractArticle.js", "-i", html_path, "-o", article_json_path])

    # Read output of call to Readability.parse() from JSON file and return as Python dictionary

    with open(article_json_path, "r", encoding="utf-8") as json_file:

        input_json = json.loads(json_file.read())

    # Deleting files after processing

    os.unlink(article_json_path)

    os.unlink(html_path)

    article_json = {

        "title": None,

        "byline": None,

        "date": None,

        "content": None,

        "plain_content": None,

        "plain_text": None

    }

    # Populate article fields from readability fields where present

    if input_json:

        if "title" in input_json and input_json["title"]:

            article_json["title"] = input_json["title"]

        if "byline" in input_json and input_json["byline"]:

            article_json["byline"] = input_json["byline"]

        if "date" in input_json and input_json["date"]:

            article_json["date"] = input_json["date"]

        if "content" in input_json and input_json["content"]:

            article_json["content"] = input_json["content"]

            article_json["plain_content"] = plain_content(article_json["content"], False, False)

            article_json["plain_text"] = extract_text_blocks_as_plain_text(article_json["plain_content"])

        if "textContent" in input_json and input_json["textContent"]:

            article_json["plain_text"] = input_json["textContent"]

            article_json["plain_text"] = re.sub(r'\n\s*\n', '\n', article_json["plain_text"])

    return article_json

log


 [on_tool_start] ToolCall:webscraper

0|flask-app  | {'url': 'https://www.baidu.com', 'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.1000.0 Safari/537.36'}

0|flask-app  | ERR: Command '['node', 'ExtractArticle.js', '-i', '/tmp/tmp_wc_0orp', '-o', '/tmp/tmp_wc_0orp.json']' died with <Signals.SIGABRT: 6>.

0|flask-app  | b''

After adding try except, the program no longer reports an error and can be used. This error report is meaningless.

Yeah, try-except can handle error issues, but it seems like the return of this tool will be replaced with an error message. I still think it's necessary to figure out why the error occurs. Besides NodeJS, could you please provide us with the Python version, system version, and running permissions for reference?

xsxs89757 commented 8 months ago

stystem : TencentOS Server release 3.1 (Final) python : 3.10.12 root run

import os
import subprocess
import tempfile
import site

def find_module_path(module_name):
    for package_path in site.getsitepackages():
        potential_path = os.path.join(package_path, module_name)
        if os.path.exists(potential_path):
            return potential_path
    return None

def test_extract_article_js():
    module_path = find_module_path('readabilipy')
    if module_path is None:
        print("Not Found Readabilipy")
        return

    jsdir = os.path.join(module_path, 'javascript')
    js_file_path = os.path.join(jsdir, 'ExtractArticle.js')

    if not os.path.exists(js_file_path):
        print(f"Not Found JavaScript: {js_file_path}")
        return
    else:
        print(f"Found JavaScript: {js_file_path}")

    with tempfile.NamedTemporaryFile(mode='w+', delete=False) as f_html:
        f_html.write("<html><body>This is a test.</body></html>")
        html_path = f_html.name

    json_path = html_path + ".json"

    try:
        subprocess.check_call(["node", js_file_path, "-i", html_path, "-o", json_path])
        print(f"Success Node.js: {json_path}")
    except subprocess.CalledProcessError as e:
        print(f"Node.js Error: {e}")
    except FileNotFoundError:
        print("Not found Node.js")

    os.unlink(html_path)
    if os.path.exists(json_path):
        os.unlink(json_path)

test_extract_article_js()

Above is my test script, I execute the above test script will not report an error, in the same environment

takatost commented 8 months ago

If possible, use docker compose to deploy first, we'll set up a server TencentOS Server release 3.1 (Final) later to verify this issue

xsxs89757 commented 8 months ago

I can now run normally after using try-except, and I can run it in this way for the time being

takatost commented 8 months ago

I can now run normally after using try-except, and I can run it in this way for the time being

So you mean, even if an exception is thrown, the output can still read the webpage content, right? Maybe it's an issue with not enough memory? 🤔

xsxs89757 commented 8 months ago

It's not an out-of-memory problem, and the same code in my test script doesn't throw an exception. Yes, now you can read the content of the web page normally by throwing an exception.

takatost commented 8 months ago

I've launched an SA5.LARGE8 instance running TencentOS 3.1 and ran the Dify source code on it. Sorry to say, but I couldn't reproduce the problem you mentioned.

xsxs89757 commented 8 months ago

I tried to redeploy on the same system and found no problem, only the one running online has this problem. Temporarily use try except to solve this problem.

langgenius / dify

webscraper error #2175

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

Sources