vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.
MIT License
5.59k stars 599 forks source link

Please stop using random crap algorithms and use correct Aho–Corasick #35

Closed mayorovp closed 6 years ago

mayorovp commented 6 years ago
kp = KeywordProcessor()
kp.add_keyword("a a")
kp.add_keyword("a b")
print(kp.extract_keywords("a a b", span_info = True)) # where is "a b"?
print(kp.extract_keywords("a a b a a", span_info = True)) # where is "a b" and second "a a"?

kp2 = KeywordProcessor()
kp2.add_keyword("a b")
kp2.add_keyword("b c d e")
print(kp2.extract_keywords("a b c d e", span_info = True)) # where is _longest_ "b c d e"?

You can use regular Aho–Corasick for you program if you include word boundary symbol in keywords.

Or you can use whole words as symbols - this will be much faster in python.

vi3k6i5 commented 6 years ago

@mayorovp Does Aho-Corasick do this?


>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('New Delhi', 'NCR region')
>>> new_sentence = keyword_processor.replace_keywords('I love Big Apple, and new delhi.')
>>> new_sentence
>>> # 'I love New York and NCR region.'```
mayorovp commented 6 years ago

Why not?

OK, in case of replace_keywords keywords cannot overlap. But in extract_keywords they can.

vi3k6i5 commented 6 years ago

No they can't in extract also Java script and java should be treated differently.. if both the terms are there in the vocab..

On Tue 5 Dec, 2017, 19:48 mayorovp, notifications@github.com wrote:

Why not?

OK, in case of replace_keywords keywords cannot overlap. But in extract_keywords they can.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/flashtext/issues/35#issuecomment-349317142, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-Nwv0JXwKZMuQn7CMyZdDyZ0in6PLLks5s9VC4gaJpZM4Q2Qv8 .

vi3k6i5 commented 6 years ago

@mayorovp give a simple Aho-Corasick code (there are many libraries, use any you like) and solve this simple thing for me:

>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('New Delhi', 'NCR region')
>>> new_sentence = keyword_processor.replace_keywords('I love Big Apple, and new delhi.')
>>> new_sentence
>>> # 'I love New York and NCR region.'
mayorovp commented 6 years ago

No they can't in extract also Java script and java should be treated differently.. if both the terms are there in the vocab..

Yes, you can use Aho-Corasick for that.

give a simple Aho-Corasick code (there are many libraries, use any you like) and solve this simple thing for me:

No, i will not write your lib for you.

here you can find the code: https://neerc.ifmo.ru/wiki/index.php?title=%D0%90%D0%BB%D0%B3%D0%BE%D1%80%D0%B8%D1%82%D0%BC_%D0%90%D1%85%D0%BE-%D0%9A%D0%BE%D1%80%D0%B0%D1%81%D0%B8%D0%BA

vi3k6i5 commented 6 years ago

No, i will not write your lib for you.

So you agree there is no existing library which does this and I will have to write my own.

@mayorovp : Only if someone would have written a library which extracted keywords considering word boundaries. Then everyone would not have to modify and implement their own Aho-Corasick algorithm.

Such a library would be so useful.

Oh wait there is one. It's called FlashText.

mayorovp commented 6 years ago

You "modified" Aho-Corasick is just a trie. You library cannot handle overlapped keywords like original Aho-Corasick can.

Look again at my tests in first message.

mayorovp commented 6 years ago

If you cannot think in terms of a, b and c - imagine that you need to find "javascript programming" and "programming resources" in text "javascript programming resources":


kp = KeywordProcessor()
kp.add_keyword("javascript programming")
kp.add_keyword("programming resources")
print(kp.extract_keywords("javascript programming resources")) # oops, where is my second keyword?!
vi3k6i5 commented 6 years ago

@mayorovp Dude, did you even read the problem statement mentioned here: https://medium.freecodecamp.org/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f

FlashText is designed to go for the first longest match, that's how it processes. That's what I wanted.. First longest match.. That's how I designed it..

mayorovp commented 6 years ago

Than stop naming it "modified Aho-Corasick".

vi3k6i5 commented 6 years ago

you named it that, no one else on earth..

This is the official statement: It's a custom algorithm based on Aho-Corasick algorithm and Trie Dictionary.

mayorovp commented 6 years ago

No, you custom algorithm based on trie and has no relations with Aho-Corasick.

vi3k6i5 commented 6 years ago

Of course it has. I read Aho-Corasick, I really liked how it processes character by character input instead of tokenised words. I used that approach in FlashText.

mayorovp commented 6 years ago

Aho-Corasick use suffix (prefix) links. Where is suffix links in your custom algorithm?

If you just "processes character by character input instead of tokenised words" than you uses common trie.

vi3k6i5 commented 6 years ago

trie is a data-structure for storing data. It's not an algorithm.

Aho-Corasick use suffix (prefix) links. Where is suffix links in your custom algorithm?

Is it like a rule in some constitution.. It's up to me what part of the algorithm I use in my library.

Dude I can't go at this all day long. I have better things to do in life.

I built a library, It helped me, I liked it and shared it. lot of other people are liking it.

I am closing this question as it's not a question anymore rather a battle of opinions. If you have any suggestions please leave them in comments, I will try my best to be unbiased and attend to them.

Or send pull request if you think something can be different, or write your own implementation. There is no one stopping you.