Swedish characters are replaced by spaces in knowledge keywords.

langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.

Other

45.48k stars 6.39k forks source link

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.6.16

Cloud or Self Hosted

Cloud

Steps to reproduce

Add a new knowledge in swedish, for example swedish_knowledge.txt
Chunk settings: Automatic, Index mode: Economical (same issue with high quality), Retrieval setting: Any
After saving and pressing go to document, go to retrieval testing in the left bar.
Enter for example "Ångloksförare" as source text and press Testing.
Press the retrieved paragraph.
See the keywords at the bottom of the popup.

✔️ Expected Behavior

I expected for example "Ångloksförare" to be a keyword.

❌ Actual Behavior

"Ångloksförare" is not present, but "ngloksf" and "rare" are.

In general, it seems that swedish letters are treated as spaces when generating keywords.

Hey @hugoake! I'm here to help you with any bugs, questions, or contributions you have for the repository. Let me know how I can assist you!

The issue you're experiencing with Swedish characters being replaced by spaces during keyword extraction is likely related to how the JiebaKeywordTableHandler class processes special characters and non-ASCII characters. The _expand_tokens_with_subtokens method uses a regular expression (\w+) to find sub-tokens within each keyword, which may not correctly handle Swedish characters.

Here's the relevant part of the code:

def _expand_tokens_with_subtokens(self, tokens: set[str]) -> set[str]:
    """Get subtokens from a list of tokens., filtering for stopwords."""
    results = set()
    for token in tokens:
        results.add(token)
        sub_tokens = re.findall(r"\w+", token)
        if len(sub_tokens) > 1:
            results.update({w for w in sub_tokens if w not in list(STOPWORDS)})

    return results

The regular expression \w+ matches any alphanumeric character and underscores, but it may not include Swedish characters like "Å", "ä", "ö". This could be causing the issue where these characters are treated as spaces.

To address this, you might need to modify the regular expression to include Swedish characters explicitly or use a different approach to handle non-ASCII characters more effectively.

For now, you can consider this as a known limitation in the current version (v0.6.16) of Dify ^[1].

To continue talking to Dosu, mention @dosu.

langgenius / dify