langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
44.41k stars 6.22k forks source link

When I use a Japanese PDF as a knowledge, It garbled. #6012

Closed mihit closed 1 week ago

mihit commented 2 months ago

Self Checks

Dify version

0.6.11

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Prepare a Japanese PDF contains "の". Use it as a knowledge.

✔️ Expected Behavior

Never garbled.

❌ Actual Behavior

Bah, it garbled as "䛾". There appears to be more than this.

dosubot[bot] commented 2 months ago

To prevent text from becoming garbled when using a Japanese PDF as a knowledge base in Dify version 0.6.11, ensure that the text extraction process supports Japanese characters. Here are some steps to help:

  1. PDF Encoding: Ensure the PDF is properly encoded with Japanese text.
  2. Text Extraction Tool: Verify that the text extraction tool or library used by Dify supports Japanese characters.

Dify supports text extraction from various document formats, including PDFs. If issues persist, you might need to check the encoding settings or use a different text extraction tool that better supports Japanese text [1].

Additionally, you can adjust configuration settings within Dify to improve text extraction accuracy for Japanese PDFs. Specifically, you can use custom segmentation and preprocessing rules:

For enhanced preprocessing capabilities, consider using optional ETL solutions like Dify ETL or Unstructured ETL [2].

To continue talking to Dosu, mention @dosu.

JohnJyong commented 1 month ago

hi @mihit ,after testing, some Japanese characters may be recognized incorrectly. If it is convenient, can you send us your document for testing? thanks in advace~

mihit commented 1 month ago

太宰治 走れメロス.pdf @JohnJyong Thank you for reply. My file is private one so I can’t share it. But I send an other Japanese PDF of literary works in the public domain for testing. Hope this helps!