Open xuexcy opened 5 years ago
Second this. Is this a limitation of the algorithm, or a simple bug? If it is the former then it should at least be documented on usage notes.
I was stuck at this too, and I tweaked the algorithm to match overlapping patterns. I will try to submit a pull request soon!
I suspect that if we reverse the document and conduct keyword matching in the reversed order, we can get both.
document = "ABC DE FGHI"
keywords = ["ABC DE", "DE FGHI"]
def extract_overlapping_keywords(document, keywords):
res = []
kp = KeywordProcessor()
kp.add_keywords_from_list(keywords)
forward_extractions = kp.extract_keywords(document)
print("Forward extraction:", forward_extractions)
res.extend(forward_extractions)
reversed_keywords = [" ".join(keyword.split(" ")[::-1]) for keyword in keywords]
reversed_kp = KeywordProcessor()
reversed_kp.add_keywords_from_list(reversed_keywords)
reversed_document = " ".join(document.split(" ")[::-1])
tmp = reversed_kp.extract_keywords(reversed_document)
reversed_extraction = [" ".join(keyword.split(" ")[::-1]) for keyword in tmp]
print("Backword segmentation:", reversed_extraction)
res.extend(reversed_extraction)
return res
extract_overlapping_keywords(document, keywords)
Plus on on this
Keyword matching in the reversed order won't work if the keywords are more than 3. For example, document = "ABC DEF GHI JKL" keywords = ["ABC DEF", "DEF GHI", "GHI JK"] In both forward and backward direction, we get only "ABC DEF" and "GHI JK"
kp = KeywordProcessor() kp.add_keyword("ABC DE") kp.add_keyword("DE FGHI") kp.extract_keywords("ABC DE FGHI")