vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.
MIT License
5.6k stars 599 forks source link

[bug] set of word boundary characters too restrictive #48

Open aseifert opened 6 years ago

aseifert commented 6 years ago

Hello there,

first of all: thanks for the amazing algorithm, it's really useful!

It turns out you use only a very restrictive set of characters as non_word_boundaries. For many languages this poses a problem. E.g. in German:

from flashtext import KeywordProcessor
kwp = KeywordProcessor()
kwp.add_keyword("lt.")
kwp.extract_keywords("Damit galt es als so gut wie fix, dass Vueling den Zuschlag erhält.")
# i would expect this to be empty

The problem can be fixed (for German) by adjusting the property non_word_boundaries:

kwp.non_word_boundaries = kwp.non_word_boundaries.union(list("ÖÄÜöäüß"))

Would you consider internationalizing the word boundaries or is this restrictive behavior on purpose?

Thanks, Alex

vi3k6i5 commented 6 years ago

Hi Alex,

I know English and hence couldn't make it work for other languages because I won't be able to understand/test the functioning.

Would you consider internationalizing the word boundaries or is this restrictive behavior on purpose?

I would consider but I don't know how. You are free to make changes that make sense to you.

Please send pull request we test cases if possible. Would really appreciate that :)

Thanks, Vikash

On Mon, Mar 19, 2018 at 9:11 PM Alexander Seifert notifications@github.com wrote:

Hello there,

first of all: thanks for the amazing algorithm, it's really useful!

It turns out you use only a very restrictive set of characters as non_word_boundaries. For many languages this poses a problem. E.g. in German:

from flashtext import KeywordProcessor kwp = KeywordProcessor() kwp.add_keyword("lt.") kwp.extract_keywords("Damit galt es als so gut wie fix, dass Vueling den Zuschlag erhält.")# i would expect this to be empty

The problem can be fixed (for German) by adjusting the property non_word_boundaries:

kwp.non_word_boundaries = kwp.non_word_boundaries.union(list("ÖÄÜöäüß"))

Would you consider internationalizing the word boundaries or is this restrictive behavior on purpose?

Thanks, Alex

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/flashtext/issues/48, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-NwiQNXHCZuantgG-JVHKiV0wn1eTaks5tf9GSgaJpZM4SwZYs .