langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
45.48k stars 6.39k forks source link

Swedish characters are replaced by spaces in knowledge keywords. #7239

Open hugoake opened 1 month ago

hugoake commented 1 month ago

Self Checks

Dify version

0.6.16

Cloud or Self Hosted

Cloud

Steps to reproduce

  1. Add a new knowledge in swedish, for example swedish_knowledge.txt
  2. Chunk settings: Automatic, Index mode: Economical (same issue with high quality), Retrieval setting: Any
  3. After saving and pressing go to document, go to retrieval testing in the left bar.
  4. Enter for example "Ångloksförare" as source text and press Testing.
  5. Press the retrieved paragraph.
  6. See the keywords at the bottom of the popup.

✔️ Expected Behavior

I expected for example "Ångloksförare" to be a keyword.

❌ Actual Behavior

"Ångloksförare" is not present, but "ngloksf" and "rare" are.

In general, it seems that swedish letters are treated as spaces when generating keywords.

dosubot[bot] commented 1 month ago

Hey @hugoake! I'm here to help you with any bugs, questions, or contributions you have for the repository. Let me know how I can assist you!

The issue you're experiencing with Swedish characters being replaced by spaces during keyword extraction is likely related to how the JiebaKeywordTableHandler class processes special characters and non-ASCII characters. The _expand_tokens_with_subtokens method uses a regular expression (\w+) to find sub-tokens within each keyword, which may not correctly handle Swedish characters.

Here's the relevant part of the code:

def _expand_tokens_with_subtokens(self, tokens: set[str]) -> set[str]:
    """Get subtokens from a list of tokens., filtering for stopwords."""
    results = set()
    for token in tokens:
        results.add(token)
        sub_tokens = re.findall(r"\w+", token)
        if len(sub_tokens) > 1:
            results.update({w for w in sub_tokens if w not in list(STOPWORDS)})

    return results

The regular expression \w+ matches any alphanumeric character and underscores, but it may not include Swedish characters like "Å", "ä", "ö". This could be causing the issue where these characters are treated as spaces.

To address this, you might need to modify the regular expression to include Swedish characters explicitly or use a different approach to handle non-ASCII characters more effectively.

For now, you can consider this as a known limitation in the current version (v0.6.16) of Dify [1].

To continue talking to Dosu, mention @dosu.

hugoake commented 3 weeks ago

The problem does not seem du be what Dosu suggested, but in the same file. Jieba seem to only work with english and chinese by default :/