Is it possible to find and replace a sentence without word boundary markers?
This kind of problem is very common in many East Asian languages such as Thai, Chinese and Japanese. These words are typically written together without word boundary markers. For simplicity, let's me give an example in English.
Example in English
test_dict = ["This","is","an","example"]
text = "Thisisanexample"
expected output : <mark>This</mark><mark>is</mark><mark>an</mark><mark>example</mark>
Currently, I am using Regex and found it is very slow to process the entire corpus because I have more than 600K words in a dictionary. I am looking for an algorithm that can run faster than Regex.
from flashtext import KeywordProcessor
processor = KeywordProcessor()
for word in test_dict:
processor.add_keyword(word,"<mark>"+word+"</mark>")
#print(word,":","<mark>"+word+"</mark>")
found = processor.replace_keywords(text)
print(found)
Is it possible to find and replace a sentence without word boundary markers?
This kind of problem is very common in many East Asian languages such as Thai, Chinese and Japanese. These words are typically written together without word boundary markers. For simplicity, let's me give an example in English.
Example in English
Currently, I am using Regex and found it is very slow to process the entire corpus because I have more than 600K words in a dictionary. I am looking for an algorithm that can run faster than Regex.
1.Regex
2.Flashtext