sotheanithsok / Habeas

A complete implementation of large scale search engine including on-disk indexing, multiple queries options, and user interfaces.
MIT License
0 stars 0 forks source link

Ranked Retrievals #58

Closed jblacklock closed 5 years ago

jblacklock commented 5 years ago

"This is the biggest new requirement. Your main program must operate in two modes: Boolean query mode, and ranked query mode. In ranked query mode, you must process a query without any Boolean operators and return the top K = 10 documents satisfying the query. Use the 'term at a time' algorithm as discussed in class:

  1. For each term t in the query: (a) Calculate wq;t = ln (1 + N/dft) (b) For each document d in t's postings list: i. Acquire an accumulator value Ad (the design of this system is up to you). ii. Calculate wd;t = 1 + ln (tft;d). iii. Increase Ad by wd;t × wq;t.
  2. For each non-zero Ad, divide Ad by Ld, where Ld is read from the docWeights.bin file.
  3. Select and return the top K = 10 documents by largest Ad value. (Use a binary heap priority queue to select the largest results; do not sort the accumulators.)

Use 8-byte floating point numbers for all the calculations.

(print ranked retrieval results: Please print the title of each document returned from a ranked retrieval, as well as the final accumulator value for that document.)"