vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.
MIT License
5.58k stars 598 forks source link

FlashText Fails to Recognize Unicode Combined Letters in Keyword Matching #143

Closed iamnotagentleman closed 6 months ago

iamnotagentleman commented 10 months ago

Content:

I've encountered an issue with FlashText where it does not correctly recognize or match Unicode combined letters. The specific test case involves the letter combination \u0069\u0307, which forms 'i̇' (a dotted i). Despite adding this as a keyword, FlashText fails to find any matches in the text.

Steps to Reproduce:

  1. Import the FlashText library and initiate the KeywordProcessor:

    from flashtext import KeywordProcessor
    keyword_processor = KeywordProcessor()
  2. Add the combined Unicode character as a keyword:

    keyword_processor.add_keyword("\u0069\u0307", "i")
    # Alternative attempt: keyword_processor.add_keyword("i̇", "i")
  3. Apply the keyword processor to a sample string containing the character:

    keywords_found = keyword_processor.extract_keywords('Deniz Çeli̇k')
  4. Observe the output:

    []

Expected Behavior:

The keyword processor should recognize the combined Unicode character \u0069\u0307 (i̇) in the string and match it accordingly.

Actual Behavior:

No matches are found, and the output is an empty list [].

Additional Information:

Dobatymo commented 10 months ago

maybe they are normalized differently?

iamnotagentleman commented 6 months ago

As @Dobatymo mentioned it's normalized differently. so i will close this issue