vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.
MIT License
5.57k stars 598 forks source link

Fix to extract Asian words not separated by space #102

Closed jwnz closed 4 years ago

jwnz commented 4 years ago

The Korean language if occasionally written without spaces separating the words. This PR aims to handle words that are in the dictionary, but not all entirely extracted due to being "stuck together" in the input text.

kp = flashtext.KeywordProcessor()
kp.add_keyword('한국')
kp.add_keyword('전력')
kp.add_keyword('공사')

kp.extract_keywords('한국전력공사')

Expected output: ['한국', '전력', '공사']

Real output: ['한국', '공사']

coveralls commented 4 years ago

Coverage Status

Coverage decreased (-0.3%) to 98.966% when pulling 6bb27adcddb3f8269b2c115b60dc50c10ef6708c on jwnz:master into 50c45f1f4a394572381249681046f57e2bf5a591 on vi3k6i5:master.

coveralls commented 4 years ago

Coverage Status

Coverage decreased (-0.3%) to 98.966% when pulling 6bb27adcddb3f8269b2c115b60dc50c10ef6708c on jwnz:master into 50c45f1f4a394572381249681046f57e2bf5a591 on vi3k6i5:master.