Problems indexing knowledge and accentuation documents in PT-BR

dersonbsb2022 commented 3 weeks ago

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.9.1-fix1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

When inserting a document in Brazilian Portuguese, the entire indexing process goes smoothly. The result in question and answer format is almost all correct, the only problem is with the formation of the keywords.

Words with accentuation in Portuguese such as:

Example: portões - port es horários - hor rios Creating wrong keywords that hinder the search. After I correct these words, the results improve considerably.

To convert the document, I use openai-4o-mini index-mode: Segmenting in Question & Answer format Segment using Portuguese embedding model - text-embedding-3-small Hybrid search top k - 6 - rerank model - cohere free - multilingual - v3.0

The question about the bug is:

is it a bug in dify? is it a bug in openai? is it a bug in cohere?

I don't know what steps I can take to try to solve the problem. No further information appears in the logs even when put in DEBUG mode.

✔️ Expected Behavior

that the keywords were generated correctly

Examples: <Portões> <Horários>

or that they were

and not broken ### ❌ Actual Behavior the generated keywords are port es ho rios Removing important letters, adding a space in place of the accented letter and leaving the keywords meaningless.

dosubot[bot] commented 3 weeks ago

Olá @dersonbsb2022! Eu sou um bot que pode te ajudar a resolver bugs, responder perguntas e contribuir. Estou aqui para ajudar enquanto você aguarda um mantenedor humano.

I found a similar discussion regarding issues with accented characters in keyword generation, specifically in French documents ^[1]. This might indicate a broader issue with handling accented characters in Dify.

For your specific case with Portuguese documents, the problem could be related to the text-embedding-3-small model or the rerank-multilingual-v3.0 model, as these are involved in the indexing and retrieval process. Unfortunately, there isn't specific information available about how these models handle accented characters.

To address this issue, you might consider:

Checking Model Documentation: Look into the documentation for the models you're using to see if there are known limitations with accented characters.
Preprocessing Text: As a temporary workaround, preprocess your text to normalize accented characters before indexing.
Alternative Models: Experiment with different models that might handle accented characters better.

If the issue persists, it might be worth reporting it as a potential bug in the Dify repository, specifying the models and configurations you're using.

To continue talking to Dosu, mention @dosu.

crazywoola commented 3 weeks ago

https://github.com/langgenius/dify/blob/9275760599abf6df762b6f8babf7eba54554119d/docker/.env.example#L505-L508

What did you set?

If you set the ETL_TYPE to dify then I guess this will happens.

dersonbsb2022 commented 3 weeks ago

First of all, thank you for the wonderful project and for your response. @crazywoola Sorry for the delay in responding but during this time I'm changed all possible settings, I even forked the EasyPanel project to configure Dify with Unstructured, but without success. Although the service started, I uploaded several files, docx, pdf, and everything continued to be processed by Dify, I don't know what I did wrong.

I will follow your suggestion and pre-process the documents before sending them.

Thank you very much.

dersonbsb2022 commented 2 weeks ago

Just to document, even though I pre-process the document and upload it to the knowledge base, the task when executed, I believe it is OpenAI, returns all the accents and the problem continues.

langgenius / dify