jingfelix / EasySearch

Apache License 2.0
1 stars 1 forks source link

理解小说段落结构 #3

Closed Leizhenpeng closed 11 months ago

Leizhenpeng commented 11 months ago

whoosh 自带支持 nest structure content https://whoosh.readthedocs.io/en/latest/api/query.html#special-queries https://whoosh.readthedocs.io/en/latest/nested.html#

在Whoosh中,可以使用store_positions=Truestore_termvector=True将词语的位置存储在文档中。

不过,你可以在添加文档时添加一个字段来存储该段落的全部内容。然后你可以根据需要检索这个字段。

schema = fields.Schema(type=fields.ID, text=fields.TEXT(stored=True), paragraph=fields.TEXT(stored=True))
ix = index.create_in("indexdir", schema)

with ix.writer() as w:
    # 我们将每个章节和段落作为一个文档存储
    with w.group():
        w.add_document(type="chap", text="Chapter 1")
        w.add_document(type="p", text="This is the first paragraph of chapter 1.", paragraph="chapter 1.")
        w.add_document(type="p", text="This is the second paragraph of chapter 1.", paragraph="chapter 1.")
    with w.group():
        w.add_document(type="chap", text="Chapter 2")
        w.add_document(type="p", text"This is the first paragraph of chapter 2.",paragraph = "chapter 2.")

然后,在获取最佳匹配句子时也检索整个段落:

with ix.searcher() as s:
    qp = QueryParser("text", schema=ix.schema)
    q = qp.parse(u"first")

    results = s.search(q)

    # 打印出包含搜索词的最佳句子
    best_hit_sentence = results[0]["text"]
    # 获取整个段落
    paragraph = results[0]["paragraph"]

print(best_hit_sentence)
print(paragraph)