Open hugoake opened 1 month ago
Hey @hugoake! I'm here to help you with any bugs, questions, or contributions you have for the repository. Let me know how I can assist you!
The issue you're experiencing with Swedish characters being replaced by spaces during keyword extraction is likely related to how the JiebaKeywordTableHandler
class processes special characters and non-ASCII characters. The _expand_tokens_with_subtokens
method uses a regular expression (\w+
) to find sub-tokens within each keyword, which may not correctly handle Swedish characters.
Here's the relevant part of the code:
def _expand_tokens_with_subtokens(self, tokens: set[str]) -> set[str]:
"""Get subtokens from a list of tokens., filtering for stopwords."""
results = set()
for token in tokens:
results.add(token)
sub_tokens = re.findall(r"\w+", token)
if len(sub_tokens) > 1:
results.update({w for w in sub_tokens if w not in list(STOPWORDS)})
return results
The regular expression \w+
matches any alphanumeric character and underscores, but it may not include Swedish characters like "Å", "ä", "ö". This could be causing the issue where these characters are treated as spaces.
To address this, you might need to modify the regular expression to include Swedish characters explicitly or use a different approach to handle non-ASCII characters more effectively.
For now, you can consider this as a known limitation in the current version (v0.6.16) of Dify [1].
To continue talking to Dosu, mention @dosu.
Self Checks
Dify version
0.6.16
Cloud or Self Hosted
Cloud
Steps to reproduce
✔️ Expected Behavior
I expected for example "Ångloksförare" to be a keyword.
❌ Actual Behavior
"Ångloksförare" is not present, but "ngloksf" and "rare" are.
In general, it seems that swedish letters are treated as spaces when generating keywords.