Open dersonbsb2022 opened 3 weeks ago
Olá @dersonbsb2022! Eu sou um bot que pode te ajudar a resolver bugs, responder perguntas e contribuir. Estou aqui para ajudar enquanto você aguarda um mantenedor humano.
I found a similar discussion regarding issues with accented characters in keyword generation, specifically in French documents [1]. This might indicate a broader issue with handling accented characters in Dify.
For your specific case with Portuguese documents, the problem could be related to the text-embedding-3-small
model or the rerank-multilingual-v3.0
model, as these are involved in the indexing and retrieval process. Unfortunately, there isn't specific information available about how these models handle accented characters.
To address this issue, you might consider:
If the issue persists, it might be worth reporting it as a potential bug in the Dify repository, specifying the models and configurations you're using.
To continue talking to Dosu, mention @dosu.
What did you set?
If you set the ETL_TYPE
to dify then I guess this will happens.
First of all, thank you for the wonderful project and for your response. @crazywoola Sorry for the delay in responding but during this time I'm changed all possible settings, I even forked the EasyPanel project to configure Dify with Unstructured, but without success. Although the service started, I uploaded several files, docx, pdf, and everything continued to be processed by Dify, I don't know what I did wrong.
I will follow your suggestion and pre-process the documents before sending them.
Thank you very much.
Just to document, even though I pre-process the document and upload it to the knowledge base, the task when executed, I believe it is OpenAI, returns all the accents and the problem continues.
Self Checks
Dify version
0.9.1-fix1
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
When inserting a document in Brazilian Portuguese, the entire indexing process goes smoothly. The result in question and answer format is almost all correct, the only problem is with the formation of the keywords.
Words with accentuation in Portuguese such as:
Example: portões - port es horários - hor rios Creating wrong keywords that hinder the search. After I correct these words, the results improve considerably.
To convert the document, I use openai-4o-mini index-mode: Segmenting in Question & Answer format Segment using Portuguese embedding model - text-embedding-3-small Hybrid search top k - 6 - rerank model - cohere free - multilingual - v3.0
The question about the bug is:
is it a bug in dify? is it a bug in openai? is it a bug in cohere?
I don't know what steps I can take to try to solve the problem. No further information appears in the logs even when put in DEBUG mode.
✔️ Expected Behavior
that the keywords were generated correctly
Examples: <Portões> <Horários>
or that they were