Document extractor parses garbled text

TChengZ commented 2 weeks ago

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.11.0

Cloud or Self Hosted

Cloud

Steps to reproduce

1、add a document extractor in the workflow 2、pass url:https://ir.mi.com/static-files/eb0d2a5b-f600-477d-9411-34c17adb819f to document extractor，and the output is garbled text 3、this also happens in https://ir.group-lotus.com/static-files/c4a4e19d-4f3f-4038-8e16-c632b4fc2833

✔️ Expected Behavior

document extractor works well

❌ Actual Behavior

document can't be extractored

dosubot[bot] commented 2 weeks ago

A similar issue regarding garbled text was previously reported and resolved. The solution involved ensuring that the content fetched by the web crawler or document extractor is correctly interpreted as UTF-8. You can set the encoding explicitly in your code to handle this. For example, if you are using Python's requests library, you can set the encoding like this:

import requests

url = 'https://your-url-here'  # Replace with your target URL
response = requests.get(url)

# Ensure the response is in UTF-8 encoding
response.encoding = 'utf-8'

# Now you can process the response text
content = response.text
print(content)

This approach ensures that the content is correctly interpreted as UTF-8, which should help in resolving the garbled text issue ^[1].

To continue talking to Dosu, mention @dosu.

hjlarry commented 2 weeks ago

it seems some kind of encryption of the pdf, cause when you open the pdf in your browser, and copy the text, it will also be garbled text

langgenius / dify