vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.
MIT License
5.6k stars 601 forks source link

Exact Match #109

Open AlpUygur opened 4 years ago

AlpUygur commented 4 years ago

Hi,

I am using flashtext for searching 694 bad words in some documents for tagging them if they contain bad language or not. But i need the exact match case because some words contain bad words in them but they are not bad words. How can I make the search for exact matches?

thakur-nandan commented 4 years ago

Hi @AlpUygur,

Just add the bad words to the Keyword Processor using the add_keyword parameter, and make sure the case_sensitive=True. I hope this solves your issue?

>>> from flashtext import KeywordProcessor >>> keyword_processor = KeywordProcessor(case_sensitive=True) >>> keyword_processor.add_keyword('Bad word 1') >>> keyword_processor.add_keyword('Bad word 2') >>> keywords_found = keyword_processor.extract_keywords('I have Bad word 1 and Bad word 2.') >>> keywords_found >>> # ['Bad word1', 'Bad word 2']

Kind Regards, Nandan Thakur

AlpUygur commented 4 years ago

Hello, Thanks for your answer but it didn't work on my case.

When I try to add words in for loop it says

"keyword_processor.add_keyword(content[i])
TypeError: list indices must be integers or slices, not str"

and I did not want to add 694 of them in hand.

vi3k6i5 commented 4 years ago

can you share some sample which fails ?

AlpUygur commented 4 years ago

For example I am looking for "am" in text. It founds "am" when there is "cam" in the text.

vi3k6i5 commented 4 years ago

This should never happen, can you pick that line and make a working example and share that.

On Mon, May 11, 2020 at 12:20 AM Alp Uygur notifications@github.com wrote:

For example I am looking for "am" in text. It founds "am" when there is "cam" in the text.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/flashtext/issues/109#issuecomment-626371843, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXY3QQPIMFJ45LKAB7T7ILRQ3ZN7ANCNFSM4MSBE2VA .

--

Vikash

iwpnd commented 4 years ago

@AlpUygur this does only happen when the "c" in "cam" for whatever reason is not part of the non_word_boundaries. Depending on the character script of your input text, this can happen.

import string
non_word_boundaries = set(string.digits + string.ascii_letters + '_')

Then check if that "c" is in non_word_boundaries.

If it is not, you have to manually add non_word_boundaries to your instance of KeywordProcessor via add_non_word_boundary().

AlpUygur commented 4 years ago

This should never happen, can you pick that line and make a working example and share that. On Mon, May 11, 2020 at 12:20 AM Alp Uygur @.***> wrote: For example I am looking for "am" in text. It founds "am" when there is "cam" in the text. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#109 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXY3QQPIMFJ45LKAB7T7ILRQ3ZN7ANCNFSM4MSBE2VA . -- Vikash

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")

def isBad(text, keyword_processor):
    keywords_found = keyword_processor.extract_keywords(text)
    if len(keywords_found) > 0:
        print(keywords_found)
        return True
    return False

with open(r"texts.txt",'r') as f:
    text = f.read()

print(isBad(text,keyword_processor))

Output: ['am'] True

texts.txt badwords.txt

Text file and bad words are here. I looked to the file and there is no "am" word in it but it still finds it. There are "am" inside of some words.

@iwpnd

iwpnd commented 4 years ago

Can’t be bothered. Re-read my last comment and read up on how flashtext treats word boundaries.

AlpUygur commented 4 years ago

It did not change anything when I add non word boundary

@iwpnd

iwpnd commented 4 years ago

@AlpUygur Just my luck then I guess. :P

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")
text = "akşamüzeri"
keyword_processor.extract_keywords(text)
>> ["am"]

changing non word boundaries

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")
keyword_processor.non_word_boundaries.update(["ş", "ü"])

text = "akşamüzeri"
keyword_processor.extract_keywords(text)
>> []
thakur-nandan commented 4 years ago

@AlpUygur probably uninstall flashtext and reinstall again?
I can't find a reason why it won't change anything with non-word boundaries. Follow the steps mentioned by @iwpnd.

Thanks @iwpnd for the clear implementation :)

Kind Regards, Nandan Thakur

AlpUygur commented 4 years ago

@AlpUygur Just my luck then I guess. :P

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")
text = "akşamüzeri"
keyword_processor.extract_keywords(text)
>> ["am"]

changing non word boundaries

keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")
keyword_processor.non_word_boundaries.update(["ş", "ü"])

text = "akşamüzeri"
keyword_processor.extract_keywords(text)
>> []

Thanks for implementation. It is very clear. On the other hand, my solved my problem by changing encoding of the file i read to utf-8. Because of badwords are in utf-8 file must be read utf-8 bytes format. I did not see this difference for a long time sorry :)