Open AlpUygur opened 4 years ago
Hi @AlpUygur,
Just add the bad words to the Keyword Processor using the add_keyword parameter, and make sure the case_sensitive=True. I hope this solves your issue?
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor(case_sensitive=True)
>>> keyword_processor.add_keyword('Bad word 1')
>>> keyword_processor.add_keyword('Bad word 2')
>>> keywords_found = keyword_processor.extract_keywords('I have Bad word 1 and Bad word 2.')
>>> keywords_found
>>> # ['Bad word1', 'Bad word 2']
Kind Regards, Nandan Thakur
Hello, Thanks for your answer but it didn't work on my case.
When I try to add words in for loop it says
"keyword_processor.add_keyword(content[i])
TypeError: list indices must be integers or slices, not str"
and I did not want to add 694 of them in hand.
can you share some sample which fails ?
For example I am looking for "am" in text. It founds "am" when there is "cam" in the text.
This should never happen, can you pick that line and make a working example and share that.
On Mon, May 11, 2020 at 12:20 AM Alp Uygur notifications@github.com wrote:
For example I am looking for "am" in text. It founds "am" when there is "cam" in the text.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/flashtext/issues/109#issuecomment-626371843, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXY3QQPIMFJ45LKAB7T7ILRQ3ZN7ANCNFSM4MSBE2VA .
--
Vikash
@AlpUygur this does only happen when the "c" in "cam" for whatever reason is not part of the non_word_boundaries. Depending on the character script of your input text, this can happen.
import string
non_word_boundaries = set(string.digits + string.ascii_letters + '_')
Then check if that "c" is in non_word_boundaries.
If it is not, you have to manually add non_word_boundaries to your instance of KeywordProcessor via add_non_word_boundary()
.
This should never happen, can you pick that line and make a working example and share that. … On Mon, May 11, 2020 at 12:20 AM Alp Uygur @.***> wrote: For example I am looking for "am" in text. It founds "am" when there is "cam" in the text. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#109 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXY3QQPIMFJ45LKAB7T7ILRQ3ZN7ANCNFSM4MSBE2VA . -- Vikash
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")
def isBad(text, keyword_processor):
keywords_found = keyword_processor.extract_keywords(text)
if len(keywords_found) > 0:
print(keywords_found)
return True
return False
with open(r"texts.txt",'r') as f:
text = f.read()
print(isBad(text,keyword_processor))
Output: ['am'] True
Text file and bad words are here. I looked to the file and there is no "am" word in it but it still finds it. There are "am" inside of some words.
@iwpnd
Can’t be bothered. Re-read my last comment and read up on how flashtext treats word boundaries.
It did not change anything when I add non word boundary
@iwpnd
@AlpUygur Just my luck then I guess. :P
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")
text = "akşamüzeri"
keyword_processor.extract_keywords(text)
>> ["am"]
changing non word boundaries
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8")
keyword_processor.non_word_boundaries.update(["ş", "ü"])
text = "akşamüzeri"
keyword_processor.extract_keywords(text)
>> []
@AlpUygur probably uninstall flashtext and reinstall again?
I can't find a reason why it won't change anything with non-word boundaries.
Follow the steps mentioned by @iwpnd.
Thanks @iwpnd for the clear implementation :)
Kind Regards, Nandan Thakur
@AlpUygur Just my luck then I guess. :P
keyword_processor = KeywordProcessor() keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8") text = "akşamüzeri" keyword_processor.extract_keywords(text) >> ["am"]
changing non word boundaries
keyword_processor = KeywordProcessor() keyword_processor.add_keyword_from_file(r"badwords.txt", encoding="utf-8") keyword_processor.non_word_boundaries.update(["ş", "ü"]) text = "akşamüzeri" keyword_processor.extract_keywords(text) >> []
Thanks for implementation. It is very clear. On the other hand, my solved my problem by changing encoding of the file i read to utf-8. Because of badwords are in utf-8 file must be read utf-8 bytes format. I did not see this difference for a long time sorry :)
Hi,
I am using flashtext for searching 694 bad words in some documents for tagging them if they contain bad language or not. But i need the exact match case because some words contain bad words in them but they are not bad words. How can I make the search for exact matches?