go-ego / riot

Go Open Source, Distributed, Simple and efficient Search Engine; Warning: This is V1 and beta version, because of big memory consume, and the V2 will be rewrite all code.
Apache License 2.0
6.11k stars 473 forks source link

TokenLoc out of length #123

Open JabinGP opened 3 years ago

JabinGP commented 3 years ago
package main

import (
    "log"

    "github.com/go-ego/riot"
    "github.com/go-ego/riot/types"
)

var (
    searcher = riot.Engine{}
)

func init() {
    initSearcher()
    initIndex()
}

func initSearcher() {
    searcher.Init(types.EngineOpts{
        Using:   3,
        GseDict: "zh",
        IndexerOpts: &types.IndexerOpts{
            IndexType: types.LocsIndex,
        },
    })
}

func initIndex() {
    docID := "1"
    content := "验证账户权限 运行一些简单的指令来验证账户的有效性 > show dbs admin 0.000GB config 0.000GB local 0.000GB > show users { \"_id\" : \"admin.admin\", \"userId\" : UUID(\"dc5760ea-c8c1-4f40-af5b-7d9d53779842\"), \"user\" : \"admin\", \"db\" : \"admin\", \"roles\" : [ { \"role\" : \"userAdminAnyDatabase\", \"db\" : \"admin\" } ], \"mechanisms\" : [ \"SCRAM-SHA-1\", \"SCRAM-SHA-256\" ] } "
    searcher.Index(docID,
        types.DocData{Content: content},
    )
    searcher.Flush()
}

func main() {
    keyword := "t"

    res := searcher.SearchDoc(types.SearchReq{Text: keyword})

    log.Println("TokenLocs = ", res.Docs[0].TokenLocs)
    log.Println("len(content) = ", len(res.Docs[0].Content))
}

Description

First TokenLoc is 495 but greater than len(content).

stuchilde commented 3 years ago

Maybe, because of different between chinese character and english letter.