go-ego / gse

Go efficient multilingual NLP and text segmentation; support English, Chinese, Japanese and others.
Apache License 2.0
2.6k stars 215 forks source link

feature: any plan on implementing the bm25 algorithm ? #181

Open CocaineCong opened 1 year ago

CocaineCong commented 1 year ago

Description

I see the bm25.go file in path(hmm/bm25/bm25.go), so I wanna ask author any plan on bm25 ? 😃

If author had the plan on implementing the bm25, I want to make it. 🫡

image
vcaesar commented 1 year ago

Emm, A lot of features are planned, but my time is limited, and there any contributions welcome.

CocaineCong commented 1 year ago

Hey, @vcaesar there is my plan on the bm25 algorithm. 🧑🏻‍💻

image

First, in the file pathhmm/idf/tag_extracker.go, I think the struct of Segment should be extracted to the new file.

type Segment struct {
    text   string
    weight float64
}

because this structure of segment should be in the state of being called at all times. if in tag_extracker, maybe there will be have some cycle import trouble. 🥲

Secondly, still in the file hmm/idf/tag_extracker.go , the struct of TagExtracter , the field of Idf should be abstracted into a relevance computation to implementing more algorithm such as IDF, TFIDF, BM25 .

before:

type TagExtracter struct {
    seg gse.Segmenter

    Idf      *Idf
    stopWord *StopWord
}

after:

type TagExtracter struct {
    seg gse.Segmenter
    // calculate weight by Relevance(including IDF,TF-IDF,BM25 and so on)
    Relevance relevance.Relevance
    stopWord *StopWord
}

what's more, the field of stopWord should in the struct of Relevance. since this field is used in the word splitting, it should be used in the struct of relevance.

before:

type TagExtracter struct {
    seg gse.Segmenter
    // calculate weight by Relevance(including IDF,TF-IDF,BM25 and so on)
    Relevance relevance.Relevance
    stopWord *StopWord
}

after:

type IDF struct {
    median float64
    freqs []float64
    Base
}

type BM25 struct {
    K1 float64
    N float64
    Base
}

type Base struct {
    // loading some stop words
    StopWord *stop_word.StopWord

    // loading segmenter for cut word
    Seg gse.Segmenter
}

And then, I want to implementing the Relevance by Strategy Pattern.

such as :

// Relevance easily scalable Relevance calculations (for idf, tf-idf, bm25 and so on)
type Relevance interface {
    // AddToken add text, frequency, position on obj
    AddToken(text string, freq float64, pos ...string) error

    // LoadDict load file from incoming parameters,
    // if incoming params no exist, will load file from default file path
    LoadDict(files ...string) error

    // LoadDictStr loading dict file by file path
    LoadDictStr(pathStr string) error

    // LoadStopWord loading word file by filename
    LoadStopWord(fileName ...string) error

    // Freq find the frequency, position, existence information of the key
    Freq(key string) (float64, string, bool)

    // TotalFreq the total number of tokens in the dictionary
    TotalFreq() float64

    // FreqMap get frequency map
    // key: word, value: frequency
    FreqMap(text string) map[string]float64

    // ConstructSeg return the segment with weight
    ConstructSeg(text string) segment.Segments
}

default IDF:

func NewIDF() Relevance {
    idf := &IDF{
        freqs: make([]float64, 0),
    }
    idf.StopWord = stop_word.NewStopWord()
    return Relevance(idf)
}

implement the interface function

// AddToken add a new word with IDF into the dictionary.
func (i *IDF) AddToken(text string, freq float64, pos ...string) error {
    err := i.Seg.AddToken(text, freq, pos...)

    i.freqs = append(i.freqs, freq)
    sort.Float64s(i.freqs)
    i.median = i.freqs[len(i.freqs)/2]
    return err
}

// LoadDict load the idf dictionary
func (i *IDF) LoadDict(files ...string) error {
    if len(files) <= 0 {
        files = i.Seg.GetIdfPath(files...)
    }

    return i.Seg.LoadDict(files...)
}

// Freq return the IDF of the word
func (i *IDF) Freq(key string) (float64, string, bool) {
    return i.Seg.Find(key)
}

....
image

any problem about that? if no big problem, I'll pr bit by bit to make it possible. 🫡