Open CocaineCong opened 1 year ago
Emm, A lot of features are planned, but my time is limited, and there any contributions welcome.
Hey, @vcaesar there is my plan on the bm25 algorithm. 🧑🏻💻
First, in the file pathhmm/idf/tag_extracker.go
, I think the struct of Segment should be extracted to the new file.
type Segment struct {
text string
weight float64
}
because this structure of segment should be in the state of being called at all times. if in tag_extracker, maybe there will be have some cycle import trouble. 🥲
Secondly, still in the file hmm/idf/tag_extracker.go
, the struct of TagExtracter , the field of Idf should be abstracted into a relevance computation to implementing more algorithm such as IDF, TFIDF, BM25 .
before:
type TagExtracter struct {
seg gse.Segmenter
Idf *Idf
stopWord *StopWord
}
after:
type TagExtracter struct {
seg gse.Segmenter
// calculate weight by Relevance(including IDF,TF-IDF,BM25 and so on)
Relevance relevance.Relevance
stopWord *StopWord
}
what's more, the field of stopWord should in the struct of Relevance. since this field is used in the word splitting, it should be used in the struct of relevance.
before:
type TagExtracter struct {
seg gse.Segmenter
// calculate weight by Relevance(including IDF,TF-IDF,BM25 and so on)
Relevance relevance.Relevance
stopWord *StopWord
}
after:
type IDF struct {
median float64
freqs []float64
Base
}
type BM25 struct {
K1 float64
N float64
Base
}
type Base struct {
// loading some stop words
StopWord *stop_word.StopWord
// loading segmenter for cut word
Seg gse.Segmenter
}
And then, I want to implementing the Relevance by Strategy Pattern.
such as :
// Relevance easily scalable Relevance calculations (for idf, tf-idf, bm25 and so on)
type Relevance interface {
// AddToken add text, frequency, position on obj
AddToken(text string, freq float64, pos ...string) error
// LoadDict load file from incoming parameters,
// if incoming params no exist, will load file from default file path
LoadDict(files ...string) error
// LoadDictStr loading dict file by file path
LoadDictStr(pathStr string) error
// LoadStopWord loading word file by filename
LoadStopWord(fileName ...string) error
// Freq find the frequency, position, existence information of the key
Freq(key string) (float64, string, bool)
// TotalFreq the total number of tokens in the dictionary
TotalFreq() float64
// FreqMap get frequency map
// key: word, value: frequency
FreqMap(text string) map[string]float64
// ConstructSeg return the segment with weight
ConstructSeg(text string) segment.Segments
}
default IDF:
func NewIDF() Relevance {
idf := &IDF{
freqs: make([]float64, 0),
}
idf.StopWord = stop_word.NewStopWord()
return Relevance(idf)
}
implement the interface function
// AddToken add a new word with IDF into the dictionary.
func (i *IDF) AddToken(text string, freq float64, pos ...string) error {
err := i.Seg.AddToken(text, freq, pos...)
i.freqs = append(i.freqs, freq)
sort.Float64s(i.freqs)
i.median = i.freqs[len(i.freqs)/2]
return err
}
// LoadDict load the idf dictionary
func (i *IDF) LoadDict(files ...string) error {
if len(files) <= 0 {
files = i.Seg.GetIdfPath(files...)
}
return i.Seg.LoadDict(files...)
}
// Freq return the IDF of the word
func (i *IDF) Freq(key string) (float64, string, bool) {
return i.Seg.Find(key)
}
....
any problem about that? if no big problem, I'll pr bit by bit to make it possible. 🫡
Description
I see the bm25.go file in path
(hmm/bm25/bm25.go)
, so I wanna ask author any plan on bm25 ? 😃If author had the plan on implementing the bm25, I want to make it. 🫡