langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
52.61k stars 7.7k forks source link

Document extractor parses garbled text #10523

Open TChengZ opened 2 weeks ago

TChengZ commented 2 weeks ago

Self Checks

Dify version

0.11.0

Cloud or Self Hosted

Cloud

Steps to reproduce

1、add a document extractor in the workflow image 2、pass url:https://ir.mi.com/static-files/eb0d2a5b-f600-477d-9411-34c17adb819f to document extractor,and the output is garbled text image 3、this also happens in https://ir.group-lotus.com/static-files/c4a4e19d-4f3f-4038-8e16-c632b4fc2833 image

✔️ Expected Behavior

document extractor works well

❌ Actual Behavior

document can't be extractored

dosubot[bot] commented 2 weeks ago

A similar issue regarding garbled text was previously reported and resolved. The solution involved ensuring that the content fetched by the web crawler or document extractor is correctly interpreted as UTF-8. You can set the encoding explicitly in your code to handle this. For example, if you are using Python's requests library, you can set the encoding like this:

import requests

url = 'https://your-url-here'  # Replace with your target URL
response = requests.get(url)

# Ensure the response is in UTF-8 encoding
response.encoding = 'utf-8'

# Now you can process the response text
content = response.text
print(content)

This approach ensures that the content is correctly interpreted as UTF-8, which should help in resolving the garbled text issue [1].

To continue talking to Dosu, mention @dosu.

hjlarry commented 2 weeks ago

it seems some kind of encryption of the pdf, cause when you open the pdf in your browser, and copy the text, it will also be garbled text