mcartright / julien

Toolkit for Information Retrieval research
7 stars 1 forks source link

unordered window background statistics are wrong #19

Closed daltonj closed 11 years ago

daltonj commented 11 years ago

Dirichlet: score = -14.850306955773318 count = 0 cf = 8.6503393E-7 len = 2152 num = 0.0012975509004517164 den = 3652.0

background stats collFreq = 218 collLength = 252013235 docFreq = 115270 numdocs = 528030

annotated node from galago: score DirichletScoringIterator 699590 false -14.941497193837915 :dirichlet:collectionLength=252013235:documentCount=528155:mu=1500:nodeFrequency=199:w=0.025 lengths StreamLengthsIterator 699590 true 2152 document extent UnorderedWindowIterator 699590 false ExtentArray:doc=699590:count=0:[] extents TermExtentIterator 699590 true ExtentArray:doc=699590:count=5:[(84,85),(922,923),(1261,1262),(2019,2020),(2036,2037)] international extents TermExtentIterator 699590 true ExtentArray:doc=699590:count=14:[(28,29),(75,76),(146,147),(210,211),(244,245),(299,300),(330,331),(354,355),(577,578),(611,612),(686,687),(710,711),(828,829),(958,959)] organized

mcartright commented 11 years ago

Testing "new york" in aquaint - offending doc is APW19980810.1040 for now. Julien has one more match.

mcartright commented 11 years ago

Apparently fixed. So, the UW operator in Galago finds maximum position based on the end of extents (which is begin+1 if no end is specified). Julien only does begins, hence we had an "off by 1" issue. Current hack of (begin+1) in the maximum position calculation fixes this, but it's a band-aid. We need proper extent/span support in Julien.

Fixed this to work w/ positions by making the "<=" into a "<" only.