Closed xsxs89757 closed 8 months ago
🤖
Hello again, @xsxs89757! It's good to see you back here contributing to the dify project. Let's see what we can do about this issue you've encountered.
Based on the error message you provided, it seems like the ExtractArticle.js
script is failing with a SIGABRT
signal. This script is part of the webscraper tool in dify and is used to parse HTML content from webpages.
There could be several reasons for this issue:
Node.js Environment: The script is run using Node.js. If Node.js is not installed or not properly configured in the environment where the Python script is running, it could cause the script to fail.
Script File Location: The script is expected to be located in the javascript
directory under the readabilipy
module's path. If the script is not found in this location, it could cause the script to fail.
Input/Output File Issues: The script expects an input HTML file and writes the output to a JSON file. If there are any issues with file permissions, disk space, or other file I/O related issues, it could cause the script to fail.
HTML Parsing Issues: The script parses the HTML content of a webpage. If there are any issues with the HTML content (e.g., malformed HTML), it could cause the script to fail.
You can find the relevant code in the extract_using_readabilipy
function in the web_reader_tool.py file.
To resolve this issue, please ensure that Node.js is installed and properly configured, the ExtractArticle.js
script is in the correct location, and that there are no issues with file permissions or disk space. Also, verify that the HTML content being parsed is well-formed.
If the issue persists after checking these potential causes, please provide more details about your environment and the specific webpage you are trying to scrape. This will help in further diagnosing the problem.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Can you provide the url you are using? @xsxs89757
Any url will report an error
In the same environment, I wrote a test script to ensure that there were no problems with the environment.
If you deploy using docker compose, will this issue come up? Guessing it might be related to the local nodejs environment, webscraper tool needs to run on nodejs code.
You can use node -v
to check your NodeJS version, btw I'm using v18.19.0 here. If your version is lower than this, you might want to consider upgrading it.
node version 18.17.1, In the same environment, I wrote a python test script, and there is no problem running the same node command.
This issue depends on your local environment, it's quite a complex problem. The environment and dependencies should match as closely as possible when deploying from source code to our provided Dockerfile. Normally, we'd suggest users to deploy our service using Docker to ensure consistency of the environment.
Modified tools/utils/web_reader_tool.py
def extract_using_readabilipy(html):
with tempfile.NamedTemporaryFile(delete=False, mode='w+') as f_html:
f_html.write(html)
f_html.close()
html_path = f_html.name
# Call Mozilla's Readability.js Readability.parse() function via node, writing output to a temporary file
article_json_path = html_path + ".json"
jsdir = os.path.join(find_module_path('readabilipy'), 'javascript')
with chdir(jsdir):
try:
p = subprocess.check_output(["node", "ExtractArticle.js", "-i", html_path, "-o", article_json_path], stderr=subprocess.STDOUT, preexec_fn=os.setsid)
print("OUT:", p)
except Exception as e:
print("ERR:", e)
print(str(e.output))
# subprocess.check_call(["node", "ExtractArticle.js", "-i", html_path, "-o", article_json_path])
# Read output of call to Readability.parse() from JSON file and return as Python dictionary
with open(article_json_path, "r", encoding="utf-8") as json_file:
input_json = json.loads(json_file.read())
# Deleting files after processing
os.unlink(article_json_path)
os.unlink(html_path)
article_json = {
"title": None,
"byline": None,
"date": None,
"content": None,
"plain_content": None,
"plain_text": None
}
# Populate article fields from readability fields where present
if input_json:
if "title" in input_json and input_json["title"]:
article_json["title"] = input_json["title"]
if "byline" in input_json and input_json["byline"]:
article_json["byline"] = input_json["byline"]
if "date" in input_json and input_json["date"]:
article_json["date"] = input_json["date"]
if "content" in input_json and input_json["content"]:
article_json["content"] = input_json["content"]
article_json["plain_content"] = plain_content(article_json["content"], False, False)
article_json["plain_text"] = extract_text_blocks_as_plain_text(article_json["plain_content"])
if "textContent" in input_json and input_json["textContent"]:
article_json["plain_text"] = input_json["textContent"]
article_json["plain_text"] = re.sub(r'\n\s*\n', '\n', article_json["plain_text"])
return article_json
log
[on_tool_start] ToolCall:webscraper
0|flask-app | {'url': 'https://www.baidu.com', 'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.1000.0 Safari/537.36'}
0|flask-app | ERR: Command '['node', 'ExtractArticle.js', '-i', '/tmp/tmp_wc_0orp', '-o', '/tmp/tmp_wc_0orp.json']' died with <Signals.SIGABRT: 6>.
0|flask-app | b''
After adding try except, the program no longer reports an error and can be used. This error report is meaningless.
Modified tools/utils/web_reader_tool.py
def extract_using_readabilipy(html): with tempfile.NamedTemporaryFile(delete=False, mode='w+') as f_html: f_html.write(html) f_html.close() html_path = f_html.name # Call Mozilla's Readability.js Readability.parse() function via node, writing output to a temporary file article_json_path = html_path + ".json" jsdir = os.path.join(find_module_path('readabilipy'), 'javascript') with chdir(jsdir): try: p = subprocess.check_output(["node", "ExtractArticle.js", "-i", html_path, "-o", article_json_path], stderr=subprocess.STDOUT, preexec_fn=os.setsid) print("OUT:", p) except Exception as e: print("ERR:", e) print(str(e.output)) # subprocess.check_call(["node", "ExtractArticle.js", "-i", html_path, "-o", article_json_path]) # Read output of call to Readability.parse() from JSON file and return as Python dictionary with open(article_json_path, "r", encoding="utf-8") as json_file: input_json = json.loads(json_file.read()) # Deleting files after processing os.unlink(article_json_path) os.unlink(html_path) article_json = { "title": None, "byline": None, "date": None, "content": None, "plain_content": None, "plain_text": None } # Populate article fields from readability fields where present if input_json: if "title" in input_json and input_json["title"]: article_json["title"] = input_json["title"] if "byline" in input_json and input_json["byline"]: article_json["byline"] = input_json["byline"] if "date" in input_json and input_json["date"]: article_json["date"] = input_json["date"] if "content" in input_json and input_json["content"]: article_json["content"] = input_json["content"] article_json["plain_content"] = plain_content(article_json["content"], False, False) article_json["plain_text"] = extract_text_blocks_as_plain_text(article_json["plain_content"]) if "textContent" in input_json and input_json["textContent"]: article_json["plain_text"] = input_json["textContent"] article_json["plain_text"] = re.sub(r'\n\s*\n', '\n', article_json["plain_text"]) return article_json
log
[on_tool_start] ToolCall:webscraper 0|flask-app | {'url': 'https://www.baidu.com', 'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.1000.0 Safari/537.36'} 0|flask-app | ERR: Command '['node', 'ExtractArticle.js', '-i', '/tmp/tmp_wc_0orp', '-o', '/tmp/tmp_wc_0orp.json']' died with <Signals.SIGABRT: 6>. 0|flask-app | b''
After adding try except, the program no longer reports an error and can be used. This error report is meaningless.
Yeah, try-except can handle error issues, but it seems like the return of this tool will be replaced with an error message. I still think it's necessary to figure out why the error occurs. Besides NodeJS, could you please provide us with the Python version, system version, and running permissions for reference?
stystem : TencentOS Server release 3.1 (Final) python : 3.10.12 root run
import os
import subprocess
import tempfile
import site
def find_module_path(module_name):
for package_path in site.getsitepackages():
potential_path = os.path.join(package_path, module_name)
if os.path.exists(potential_path):
return potential_path
return None
def test_extract_article_js():
module_path = find_module_path('readabilipy')
if module_path is None:
print("Not Found Readabilipy")
return
jsdir = os.path.join(module_path, 'javascript')
js_file_path = os.path.join(jsdir, 'ExtractArticle.js')
if not os.path.exists(js_file_path):
print(f"Not Found JavaScript: {js_file_path}")
return
else:
print(f"Found JavaScript: {js_file_path}")
with tempfile.NamedTemporaryFile(mode='w+', delete=False) as f_html:
f_html.write("<html><body>This is a test.</body></html>")
html_path = f_html.name
json_path = html_path + ".json"
try:
subprocess.check_call(["node", js_file_path, "-i", html_path, "-o", json_path])
print(f"Success Node.js: {json_path}")
except subprocess.CalledProcessError as e:
print(f"Node.js Error: {e}")
except FileNotFoundError:
print("Not found Node.js")
os.unlink(html_path)
if os.path.exists(json_path):
os.unlink(json_path)
test_extract_article_js()
Above is my test script, I execute the above test script will not report an error, in the same environment
If possible, use docker compose to deploy first, we'll set up a server TencentOS Server release 3.1 (Final)
later to verify this issue
I can now run normally after using try-except, and I can run it in this way for the time being
I can now run normally after using try-except, and I can run it in this way for the time being
So you mean, even if an exception is thrown, the output can still read the webpage content, right? Maybe it's an issue with not enough memory? 🤔
It's not an out-of-memory problem, and the same code in my test script doesn't throw an exception. Yes, now you can read the content of the web page normally by throwing an exception.
I've launched an SA5.LARGE8 instance running TencentOS 3.1 and ran the Dify source code on it. Sorry to say, but I couldn't reproduce the problem you mentioned.
I tried to redeploy on the same system and found no problem, only the one running online has this problem. Temporarily use try except to solve this problem.
Self Checks
Dify version
0.5.0
Cloud or Self Hosted
Self Hosted (Source)
Steps to reproduce
Open the webscraper tool and let it grab the url Times error
✔️ Expected Behavior
search url
❌ Actual Behavior
tool invoke error: Command '['node', 'ExtractArticle.js', '-i', '/tmp/tmpsukdz5ht', '-o', '/tmp/tmpsukdz5ht.json']' died with <Signals.SIGABRT: 6>.