micpst / minisearch

Restful, in-memory, full-text search engine
MIT License
31 stars 2 forks source link

[Question] - How can I handle indexing millions of data with this library? #18

Closed sujit-baniya closed 1 year ago

sujit-baniya commented 1 year ago

I've been working on small library on the top of this package. At the moment, I've millions of data to be indexed.

panic: runtime error: slice bounds out of range [5:4]

goroutine 90 [running]:
golang.org/x/text/transform.String({0x3173148, 0xc00142b710}, {0xc00239da70, 0x4})
        /home/sujit/go/pkg/mod/golang.org/x/text@v0.10.0/transform/transform.go:650 +0xbe5

Error occurs in Tokenizer

func normalizeToken(params *normalizeParams, config *Config) string {
    token := params.token

    if _, ok := stopWords[params.language][token]; config.EnableStopWords && ok {
        return ""
    }

    if stem, ok := stems[params.language]; config.EnableStemming && ok {
        token = stem(token, false)
    }

        // Error comes here
    if normToken, _, err := transform.String(normalizer, token); err == nil {
        return normToken
    }

    return token
}

Can you please suggest some solution?

micpst commented 1 year ago

I cannot reproduce this error, please provide input data.

sujit-baniya commented 1 year ago

@micpst Code i tried:

package main

import (
    "encoding/json"
    "fmt"
    "os"

    "github.com/micpst/minisearch/pkg/store"
    "github.com/micpst/minisearch/pkg/tokenizer"
)

type ICD struct {
    Code string `json:"code"`
    Desc string `json:"desc"`
}

func readData() (icds []ICD) {
    jsonData, err := os.ReadFile("icd10_codes.json")
    if err != nil {
        fmt.Printf("failed to read json file, error: %v", err)
        return
    }

    if err := json.Unmarshal(jsonData, &icds); err != nil {
        fmt.Printf("failed to unmarshal json file, error: %v", err)
        return
    }
    return
}

func main() {
    data := readData()
    db := store.New[ICD](&store.Config{
        DefaultLanguage: tokenizer.ENGLISH,
        TokenizerConfig: &tokenizer.Config{
            EnableStemming:  true,
            EnableStopWords: true,
        },
    })
    p := store.InsertBatchParams[ICD]{
        Documents: data,
        BatchSize: 100,
    }
    errs := db.InsertBatch(&p)
    if len(errs) > 0 {
        panic(errs)
    }
}

icd10_codes.json.zip

micpst commented 1 year ago

You have an error in your store schema definition, you need to add index tag to the fields you want to query.

type ICD struct {
    Code string `json:"code" index:"code"`
    Desc string `json:"desc" index:"desc"`
}

I still cannot reproduce the error, even without this modification db.InsertBatch works correctly.

sujit-baniya commented 1 year ago

@micpst Weird. Maybe it's because of map data. I've modified this repo to support for map data. Maybe you add support for map?

micpst commented 1 year ago

If you have modified the lib on your own, I can't really help. Currently there is no need to add map support, you can easily nest properties using the current implementation.