projectEndings / staticSearch

A codebase to support a pure JSON search engine requiring no backend for any XHTML5 document collection
https://endings.uvic.ca/staticSearch/docs/index.html
Mozilla Public License 2.0
46 stars 21 forks source link

Consider more robust scoring mechanisms for relevance of multiple keywords #299

Open joeytakeda opened 2 months ago

joeytakeda commented 2 months ago

The issue of multiple term queries and ranking has come up a few times— E.g. if for a query for X and Y across two documents, A and B, if in Doc A, X = 40 and Y=0, but in Document B, X=15 and Y=10 , then Document A is ranked ahead of document B, even though it doesn't contain both query terms.

Of course, this can be solved by precise matching (+X +Y), but sometimes you don't necessarily want all of the query terms, but you would like to see the documents that have both of them come up first. This is the more intuitive people use searching, I'd argue — if I went to a library catalogue and looked up Keats Byron Shelley, I'd expect to see books about the Romantics above the ones about Keats or Byron or Shelley.

In part, TF-IDF was meant to help with that and it is useful for normalizing relevance in a collection of documents of varying lengths, but the pure sum of the TF-IDF measures doesn't really rectify the issue. So, doing some research has lead me to believe we might want to try implementing a third scoring algorithm—namely, BM25.

This is what ElasticSearch uses as their primary relevance mechanism and, thanks to the aforelinked guide, I was able to quickly put together the algorithm (based on the TF-IDF implementation we already had):

    <xsl:function name="hcmc:returnBM25" as="xs:double">
        <xsl:param name="rawScore" as="xs:integer"/>
        <xsl:param name="stemDocsCount" as="xs:integer"/>
        <xsl:param name="thisDocUri" as="xs:string"/>
        <!--Compute the average document length-->
        <!--TODO: While getTotalTermsInDoc is memoized, 
            this should be moved outside
            of the function-->
        <xsl:variable name="averageDocLength" 
            select="sum($tokenizedUris ! hcmc:getTotalTermsInDoc(.))
                        div 
                    count($tokenizedUris)" as="xs:double"/>

        <!--Get the total terms in the document-->
        <xsl:variable name="totalTermsInDoc" 
            select="hcmc:getTotalTermsInDoc($thisDocUri)" as="xs:integer"/>

        <!--The ratio of terms in this doc versus the average document length-->
        <xsl:variable name="lengthRatio" select="$totalTermsInDoc div $averageDocLength"/>

        <!--Now calculate the modified inverse document frequency —
            see https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
            If you had 10 documents and 3 in which this stem appeared, you'd get:
            ln(1 + ((10 - (3 + 0.5))/(3 + 0.5)))
            => ln(1 + (6.5 / 3.5))
            => ln(1 + 1.857) => ln(2.857) => 1.050

            [Note that math:log is, in XSLT, the natural log]
        -->
        <xsl:variable name="idf" 
            as="xs:double"
            select="math:log(1 + 
                (($tokenizedDocsCount - ($stemDocsCount + 0.5))
                    div 
                 ($stemDocsCount + 0.5)
                 ))"/>

        <!--For b and k, see          
            https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html.
            -->
        <!--"b" is a constant, which "controls to what degree document length normalizes tf values" -->
        <xsl:variable name="b" select="0.75"/>

        <!--"k1" is a constant, which "controls non-linear term frequency normalization (saturation)"-->
        <xsl:variable name="k1" select="1.2"/>

        <!--Splitting the numerator and denominator for readability-->
        <xsl:variable name="numerator" 
            select="$rawScore * ($k1 + 1)"/>
        <xsl:variable name="denominator" 
            select="$rawScore + $k1 * (1 - $b + $b * $lengthRatio)"/>

        <!--Now the final BM25 relevance score-->
        <xsl:variable name="BM25" select="$idf * ($numerator div $denominator)" as="xs:double"/>
        <xsl:message use-when="$verbose">Calculated BM25: <xsl:sequence select="$BM25"/></xsl:message>
        <xsl:sequence select="$BM25"/>
    </xsl:function>

I'm going to add this to a branch and start playing around with to see how it might improve things — some very quick tests yield good results (at least compared to tf-idf), but it's hard to test without running it on a larger document collection.