langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
50.76k stars 7.3k forks source link

Problems indexing knowledge and accentuation documents in PT-BR #9298

Open dersonbsb2022 opened 3 weeks ago

dersonbsb2022 commented 3 weeks ago

Self Checks

Dify version

0.9.1-fix1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

When inserting a document in Brazilian Portuguese, the entire indexing process goes smoothly. The result in question and answer format is almost all correct, the only problem is with the formation of the keywords.

Words with accentuation in Portuguese such as:

Example: portões - port es horários - hor rios Creating wrong keywords that hinder the search. After I correct these words, the results improve considerably.

To convert the document, I use openai-4o-mini index-mode: Segmenting in Question & Answer format Segment using Portuguese embedding model - text-embedding-3-small Hybrid search top k - 6 - rerank model - cohere free - multilingual - v3.0

The question about the bug is:

is it a bug in dify? is it a bug in openai? is it a bug in cohere?

I don't know what steps I can take to try to solve the problem. No further information appears in the logs even when put in DEBUG mode.

Captura de Tela 2024-10-13 às 19 07 07 Captura de Tela 2024-10-13 às 19 05 26

✔️ Expected Behavior

that the keywords were generated correctly

Examples: <Portões> <Horários>

or that they were

and not broken ### ❌ Actual Behavior the generated keywords are port es ho rios Removing important letters, adding a space in place of the accented letter and leaving the keywords meaningless.
dosubot[bot] commented 3 weeks ago

Olá @dersonbsb2022! Eu sou um bot que pode te ajudar a resolver bugs, responder perguntas e contribuir. Estou aqui para ajudar enquanto você aguarda um mantenedor humano.

I found a similar discussion regarding issues with accented characters in keyword generation, specifically in French documents [1]. This might indicate a broader issue with handling accented characters in Dify.

For your specific case with Portuguese documents, the problem could be related to the text-embedding-3-small model or the rerank-multilingual-v3.0 model, as these are involved in the indexing and retrieval process. Unfortunately, there isn't specific information available about how these models handle accented characters.

To address this issue, you might consider:

  1. Checking Model Documentation: Look into the documentation for the models you're using to see if there are known limitations with accented characters.
  2. Preprocessing Text: As a temporary workaround, preprocess your text to normalize accented characters before indexing.
  3. Alternative Models: Experiment with different models that might handle accented characters better.

If the issue persists, it might be worth reporting it as a potential bug in the Dify repository, specifying the models and configurations you're using.

To continue talking to Dosu, mention @dosu.

crazywoola commented 3 weeks ago

https://github.com/langgenius/dify/blob/9275760599abf6df762b6f8babf7eba54554119d/docker/.env.example#L505-L508

What did you set?

If you set the ETL_TYPE to dify then I guess this will happens.

dersonbsb2022 commented 3 weeks ago

First of all, thank you for the wonderful project and for your response. @crazywoola Sorry for the delay in responding but during this time I'm changed all possible settings, I even forked the EasyPanel project to configure Dify with Unstructured, but without success. Although the service started, I uploaded several files, docx, pdf, and everything continued to be processed by Dify, I don't know what I did wrong.

I will follow your suggestion and pre-process the documents before sending them.

Thank you very much.

dersonbsb2022 commented 2 weeks ago

Just to document, even though I pre-process the document and upload it to the knowledge base, the task when executed, I believe it is OpenAI, returns all the accents and the problem continues.