Open TChengZ opened 2 weeks ago
A similar issue regarding garbled text was previously reported and resolved. The solution involved ensuring that the content fetched by the web crawler or document extractor is correctly interpreted as UTF-8. You can set the encoding explicitly in your code to handle this. For example, if you are using Python's requests
library, you can set the encoding like this:
import requests
url = 'https://your-url-here' # Replace with your target URL
response = requests.get(url)
# Ensure the response is in UTF-8 encoding
response.encoding = 'utf-8'
# Now you can process the response text
content = response.text
print(content)
This approach ensures that the content is correctly interpreted as UTF-8, which should help in resolving the garbled text issue [1].
To continue talking to Dosu, mention @dosu.
it seems some kind of encryption of the pdf, cause when you open the pdf in your browser, and copy the text, it will also be garbled text
Self Checks
Dify version
0.11.0
Cloud or Self Hosted
Cloud
Steps to reproduce
1、add a document extractor in the workflow 2、pass url:https://ir.mi.com/static-files/eb0d2a5b-f600-477d-9411-34c17adb819f to document extractor,and the output is garbled text 3、this also happens in https://ir.group-lotus.com/static-files/c4a4e19d-4f3f-4038-8e16-c632b4fc2833
✔️ Expected Behavior
document extractor works well
❌ Actual Behavior
document can't be extractored