Keywords are not extracted in Farsi (Persian) documents

MitraSafarinejad commented 6 months ago

Self Checks

[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] Pleas do not modify this template :) and fill in all the required fields.

Provide a description of the help you need

hello, I work on Farsi documents and after I upload my documents, I check the document and at the bottom of each chunk I can not see any keywords (see the picture below) unless I have some English words or numbers in my documents. Does Dify support any language other than English or Chinese for extracting keywords? Untitled

dosubot[bot] commented 6 months ago

🤖

سلام دوباره, @MitraSafarinejad! خوشحالم که دوباره شما را اینجا می‌بینم. امیدوارم حالتون خوب باشه.

Based on the information provided in the repository, it appears that Dify currently does not support languages other than English or Chinese for extracting keywords from documents. This is indicated in the api/constants/languages.py file, where Farsi (or Persian) is not listed among the supported languages.

The keyword extraction process in Dify seems to be handled by the Jieba library, as suggested by the presence of a file named stopwords.py within a directory path that includes keyword/jieba. This file is likely responsible for defining stopwords, which are commonly filtered out during the keyword extraction process. However, the Jieba library is primarily designed for processing Chinese text. You can find more details in the api/core/rag/datasource/keyword/jieba/stopwords.py file.

Unfortunately, without support for Farsi in the current implementation, keywords from Farsi documents cannot be extracted. If you have any further questions or need additional clarification, please feel free to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

JohnJyong commented 6 months ago

Unfortunately, Jieba currently does not support Persian word segmentation. I think you can try the high-quality mode in the Knowledge module . It supports vector similarity matching, full-text retrieval and Rerank, which will make your retrieval more accurate. Of course, we will also introduce ES and support Persian word segmentation in the future, so stay tuned. @MitraSafarinejad

dosubot[bot] commented 5 months ago

Hi, @MitraSafarinejad,

I'm helping the team manage their backlog and am marking this issue as stale. From what I understand, you reported that Dify does not extract keywords from Farsi documents unless there are English words or numbers present. It was mentioned that Dify currently does not support Farsi for keyword extraction, and it was suggested to use the high-quality mode in the Knowledge module as an alternative.

Could you please confirm if this issue is still relevant to the latest version of the repository? If it is, please let the team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

MitraSafarinejad commented 5 months ago

hello, this issue is still there. Dify does not support Farsi for keyword extraction, and I used the high-quality mode in the Knowledge module as an alternative but it did not solve my problem. I'd appreciate it if you could support the Farsi language

crazywoola commented 4 months ago

Hi there, we will try to fix this.

langgenius / dify