internationalize word boundary checks

vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.

MIT License

5.57k stars 599 forks source link

internationalize word boundary checks #49

Open aseifert opened 6 years ago

aseifert commented 6 years ago

Hi there,

I think the only safe way to deal with issue #48 would be to test against the \W class [1]. Judging from the benchmarks linked on https://github.com/vi3k6i5/flashtext#why-not-regex this seems to run slower by a factor of 1-2 though.

Best, Alex

[1] Quoting the Python docs:

\b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

coveralls commented 6 years ago

Coverage increased (+0.7%) to 100.0% when pulling 9b6b187b2b67ad279092d3f36f3dd4d64b8994a9 on aseifert:master into 5591859aabe3da37499a20d0d0d6dd77e480ed8d on vi3k6i5:master.

coveralls commented 6 years ago

Coverage increased (+0.7%) to 100.0% when pulling 9b6b187b2b67ad279092d3f36f3dd4d64b8994a9 on aseifert:master into 5591859aabe3da37499a20d0d0d6dd77e480ed8d on vi3k6i5:master.

ioistired commented 5 years ago

Another way, based on https://stackoverflow.com/a/2998550:

def is_word_char(c, _categories=frozenset({'Ll', 'Lu', 'Lt', 'Lo', 'Lm', 'Nd', 'Pc'})):
    return unicodedata.category(c) in _categories

senpos commented 4 years ago

Another way to do it:

from functools import lru_cache

from flashtext import KeywordProcessor

class NonWordBoundaries:
    def __init__(self, *predicates):
        self.predicates = predicates

    @lru_cache(maxsize=128)
    def __contains__(self, ch):
        for predicate in self.predicates:
            if predicate(ch):
                return True
        return False

def main():
    words_to_search = ["рок"]

    keyword_processor = KeywordProcessor()
    keyword_processor.set_non_word_boundaries(NonWordBoundaries(str.isalpha, str.isdigit))
    keyword_processor.add_keywords_from_list(words_to_search)
    keywords_found = keyword_processor.extract_keywords('рок порок роковой')
    print(keywords_found)

Not sure about performance though. But at least it is easy to modify the behaviour.

alexpeaceca commented 4 years ago

Benchmarks vs. Regex are for the English only char set. Is increasing the word boundaries like this effecting flashtext performance in any significant way?