projectEndings / staticSearch

A codebase to support a pure JSON search engine requiring no backend for any XHTML5 document collection
https://endings.uvic.ca/staticSearch/docs/index.html
Mozilla Public License 2.0
50 stars 22 forks source link

Cleaning steps removes crucial info and needs to be rethought #246

Closed joeytakeda closed 4 months ago

joeytakeda commented 2 years ago

As reported by @martindholmes, spans that are not directly addressed by the config file lose crucial information that is required for weighting and contextualizing. From email:

Given a config file containing this:

<context match="div[child::span[@data-class='address']]" label="Address
field"/>

which (apparently correctly) generates config.xsl like this (simplified a bit):

    <xsl:variable name="ssContextMap" as="map(*)?">
       <xsl:map>
          [...]
          <xsl:map-entry key="'Address field'" select="'ssCtx1'"/>
       </xsl:map>
    </xsl:variable>
    <xsl:template match="div[child::span[@data-class='address']]"
                  priority="1"
                  mode="contextualize">
       <xsl:if test="$verbose">
          <xsl:message>Template #contextualize: Adding @ss-ctx flag to
<xsl:value-of select="local-name(.)"/>
          </xsl:message>
       </xsl:if>
       <xsl:copy>
          <xsl:apply-templates select="@*" mode="#current"/>
          <xsl:if test="self::div[child::span[@data-class='address']]">
             <xsl:attribute name="ss-ctx" select="'true'"/>
             <xsl:attribute name="ss-ctx-id" select="'ssCtx1'"/>
          </xsl:if>
          <xsl:apply-templates select="node()" mode="#current"/>
       </xsl:copy>
    </xsl:template>

I get output like this in the tokenized file:

<div>
                   <div ss-ctx="true"><a><span ss-pos="22"
ss-fid="cr_1871_1054" ss-stem="fur">Fur</span> <span ss-pos="23"
ss-fid="cr_1871_1054" ss-stem="dealer">Dealer</span></a> (from census),
<a><span ss-pos="24" ss-fid="cr_1871_1054"
ss-stem="furrier">Furrier</span></a> (from <span ss-pos="25"
ss-fid="cr_1871_1054" ss-stem="street">street</span> <span ss-pos="26"
ss-fid="cr_1871_1054" ss-stem="directori">directory</span>)</div>
                </div>

In other words, neither the @ss-ctx nor the @ss-ctx-id attributes are realized on the <div>element in the tokenized file.

After some debugging, it's clear that this could be resolved if the cleaning step was moved to after the weighting and contextualizing. Originally, cleaning was necessary to keep the tree size minimal during tokenizing, but that's less of a concern now that we process the documents one-by-one.

While we still need to "clean" the document so that the collection is as small as possible when it moves to the JSON, we should consider moving the cleaning stage to either the end of the tokenizing process or as a step between the JSON and the tokenization. If we do that, we may also be able to retain more information by the time the document hits the JSON stage, which could allow for better in-page hit finding (i.e. since we wouldn't be modifying the document structure, but just decorating it with attributes and adding spans, there may be ways to add better locations for each hit.)

joeytakeda commented 5 months ago

I think this should now be resolved with #284 as things are cleaned only once it's been determined that they can in fact be ignored.