topicai / weakand

6 stars 2 forks source link

Index.go: question around line 75 #3

Open brjg opened 8 years ago

brjg commented 8 years ago

In function 'add', each time append a posting. In function 'Add', for each changed term, sort the postings according to their DID to maintain sorted.

In 'add' function, instead of append the new posting at the end, if use binary search to find the loc to insert, we maintain the sorted order by nature. Then we don't need the work in 'Add' function. It will become more efficient.

something like this around https://github.com/topicai/weakand/blob/develop/index.go#L75:

for term, tf := range d.Terms {
    insertPos = sort.Search(len(idx.Ivt[term]), func(i int) bool {return idx.Ivt[term][i].DocId >= did})
    idx.Ivt[term] = append(idx.Ivt[:insertPos], append(Posting{DocId: did, TF: tf}, idx.Ivt[insertPos:]...)...)
    // changed[term]++
}
yitopic commented 8 years ago

I agree that type PostingList []Posting has a lot to be optimized. The current simplest definition is only for keeping the code simple, so we can focus on more important topics like https://github.com/topicai/weakand/issues/1 and https://github.com/topicai/weakand/issues/2.

Indeed, as each call to append might cause memory allocation/deallocation, the double-append solution as posted above

idx.Ivt[term] = append(idx.Ivt[:insertPos], append(Posting{DocId: did, TF: tf}, idx.Ivt[insertPos:]...)...)

might not necessarily an optimal solution faster than the current sort-based one.

A solution to minimize allocation/deallocation as well as memory copy is a data structure called "chained vectors", which is a sorted vector of sorted vectors.

Anyway I agree that the data structure of posting list is a very important topic for search engines. How about we try to finish https://github.com/topicai/weakand/issues/1 and https://github.com/topicai/weakand/issues/2, then investigate as many choices we can do for this?