vi3k6i5 / flashtext

Extract Keywords from sentence or Replace keywords in sentences.
MIT License
5.57k stars 599 forks source link

can't search overlapped words? #63

Open xuexcy opened 5 years ago

xuexcy commented 5 years ago

kp = KeywordProcessor() kp.add_keyword("ABC DE") kp.add_keyword("DE FGHI") kp.extract_keywords("ABC DE FGHI")

['ABC DE'] why not ['ABC DE', 'DE FGHI']

jdclarke5 commented 5 years ago

Second this. Is this a limitation of the algorithm, or a simple bug? If it is the former then it should at least be documented on usage notes.

aneeshvartakavi commented 5 years ago

I was stuck at this too, and I tweaked the algorithm to match overlapping patterns. I will try to submit a pull request soon!

mickeysjm commented 5 years ago

I suspect that if we reverse the document and conduct keyword matching in the reversed order, we can get both.

document = "ABC DE FGHI"
keywords = ["ABC DE", "DE FGHI"]

def extract_overlapping_keywords(document, keywords):
    res = []
    kp = KeywordProcessor()
    kp.add_keywords_from_list(keywords)
    forward_extractions = kp.extract_keywords(document)
    print("Forward extraction:", forward_extractions)
    res.extend(forward_extractions)

    reversed_keywords = [" ".join(keyword.split(" ")[::-1]) for keyword in keywords]
    reversed_kp = KeywordProcessor()
    reversed_kp.add_keywords_from_list(reversed_keywords)    
    reversed_document = " ".join(document.split(" ")[::-1])
    tmp = reversed_kp.extract_keywords(reversed_document)
    reversed_extraction = [" ".join(keyword.split(" ")[::-1]) for keyword in tmp]
    print("Backword segmentation:", reversed_extraction)
    res.extend(reversed_extraction)

    return res

extract_overlapping_keywords(document, keywords)
Vineeth-Mohan commented 5 years ago

Plus on on this

wangpeipei90 commented 5 years ago

Keyword matching in the reversed order won't work if the keywords are more than 3. For example, document = "ABC DEF GHI JKL" keywords = ["ABC DEF", "DEF GHI", "GHI JK"] In both forward and backward direction, we get only "ABC DEF" and "GHI JK"