Calculating Document Weights

"Ranked retrieval requires calculating a 'weight' of each document to use in normalization, so that long documents don't receive 'extra relevance' just because they are longer. The weight of each document is called Ld and can be calculated during the indexing process. Each Ld value is equal to the Euclidean normalization of the vector of wd;t weights for the document, where wd;t = 1 + ln (tft;d) Note that this formula only needs tft;d, which is the number of times a particular term occurs in the document. So to calculate Ld for a document that you just finished indexing, you need to know each term that occurred at least once in the document along with the number of times that term occurred; a HashMap from term to integer, updated as you read each token, can help track this. For each term in the final map, the integer it maps to is tft;d for that term and can be used to calculate wd;t for that term. To normalize those wd;t terms and find Ld, sum the squares of all the wd;t terms, and then take the square root: Ld = qPt (wd;t)2. Each Ld term needs to be written as an 8-byte double in document order to a file called docWeights.bin. Modify DiskIndexWriter so that it creates this file during the indexing process. Modify DiskPositionalIndex so it knows how to open this file and skip to an appropriate location to read a 8-byte double for Ld. Ld values will be used when calculating ranked retrieval scores."

sotheanithsok / Habeas

Calculating Document Weights #56