Open brjg opened 8 years ago
I agree that type PostingList []Posting
has a lot to be optimized. The current simplest definition is only for keeping the code simple, so we can focus on more important topics like https://github.com/topicai/weakand/issues/1 and https://github.com/topicai/weakand/issues/2.
Indeed, as each call to append
might cause memory allocation/deallocation, the double-append
solution as posted above
idx.Ivt[term] = append(idx.Ivt[:insertPos], append(Posting{DocId: did, TF: tf}, idx.Ivt[insertPos:]...)...)
might not necessarily an optimal solution faster than the current sort
-based one.
A solution to minimize allocation/deallocation as well as memory copy is a data structure called "chained vectors", which is a sorted vector of sorted vectors.
Anyway I agree that the data structure of posting list is a very important topic for search engines. How about we try to finish https://github.com/topicai/weakand/issues/1 and https://github.com/topicai/weakand/issues/2, then investigate as many choices we can do for this?
In function 'add', each time append a posting. In function 'Add', for each changed term, sort the postings according to their DID to maintain sorted.
In 'add' function, instead of append the new posting at the end, if use binary search to find the loc to insert, we maintain the sorted order by nature. Then we don't need the work in 'Add' function. It will become more efficient.
something like this around https://github.com/topicai/weakand/blob/develop/index.go#L75: