vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.
MIT License
5.57k stars 599 forks source link

Replacing text without word boundary markers #50

Open Ekkalak-T opened 6 years ago

Ekkalak-T commented 6 years ago

Is it possible to find and replace a sentence without word boundary markers?

This kind of problem is very common in many East Asian languages such as Thai, Chinese and Japanese. These words are typically written together without word boundary markers. For simplicity, let's me give an example in English.

Example in English

test_dict = ["This","is","an","example"]
text = "Thisisanexample"
expected output : <mark>This</mark><mark>is</mark><mark>an</mark><mark>example</mark>

Currently, I am using Regex and found it is very slow to process the entire corpus because I have more than 600K words in a dictionary. I am looking for an algorithm that can run faster than Regex.

1.Regex

import re
namesRegex = re.compile(r'(' + '|'.join(test_dict) + ')', re.I)
replaced = namesRegex.sub(r'<mark>\1</mark>', text)
print(replaced)
     Output
    `<mark>This</mark><mark>is</mark><mark>an</mark><mark>example</mark>`

2.Flashtext

from flashtext import KeywordProcessor
processor = KeywordProcessor()
for word in test_dict:
    processor.add_keyword(word,"<mark>"+word+"</mark>")
    #print(word,":","<mark>"+word+"</mark>")

found = processor.replace_keywords(text)
print(found)
   Output
  `Thisisanexample`
SeekPoint commented 6 years ago

also expect an unboundary version