Cleaning steps removes crucial info and needs to be rethought

As reported by @martindholmes, spans that are not directly addressed by the config file lose crucial information that is required for weighting and contextualizing. From email:

Given a config file containing this:

<context match="div[child::span[@data-class='address']]" label="Address
field"/>

which (apparently correctly) generates config.xsl like this (simplified a bit):

    <xsl:variable name="ssContextMap" as="map(*)?">
       <xsl:map>
          [...]
          <xsl:map-entry key="'Address field'" select="'ssCtx1'"/>
       </xsl:map>
    </xsl:variable>
    <xsl:template match="div[child::span[@data-class='address']]"
                  priority="1"
                  mode="contextualize">
       <xsl:if test="$verbose">
          <xsl:message>Template #contextualize: Adding @ss-ctx flag to
<xsl:value-of select="local-name(.)"/>
          </xsl:message>
       </xsl:if>
       <xsl:copy>
          <xsl:apply-templates select="@*" mode="#current"/>
          <xsl:if test="self::div[child::span[@data-class='address']]">
             <xsl:attribute name="ss-ctx" select="'true'"/>
             <xsl:attribute name="ss-ctx-id" select="'ssCtx1'"/>
          </xsl:if>
          <xsl:apply-templates select="node()" mode="#current"/>
       </xsl:copy>
    </xsl:template>

I get output like this in the tokenized file:

<div>
                   <div ss-ctx="true"><a><span ss-pos="22"
ss-fid="cr_1871_1054" ss-stem="fur">Fur</span> <span ss-pos="23"
ss-fid="cr_1871_1054" ss-stem="dealer">Dealer</span></a> (from census),
<a><span ss-pos="24" ss-fid="cr_1871_1054"
ss-stem="furrier">Furrier</span></a> (from <span ss-pos="25"
ss-fid="cr_1871_1054" ss-stem="street">street</span> <span ss-pos="26"
ss-fid="cr_1871_1054" ss-stem="directori">directory</span>)</div>
                </div>

In other words, neither the @ss-ctx nor the @ss-ctx-id attributes are realized on the <div>element in the tokenized file.

After some debugging, it's clear that this could be resolved if the cleaning step was moved to after the weighting and contextualizing. Originally, cleaning was necessary to keep the tree size minimal during tokenizing, but that's less of a concern now that we process the documents one-by-one.

While we still need to "clean" the document so that the collection is as small as possible when it moves to the JSON, we should consider moving the cleaning stage to either the end of the tokenizing process or as a step between the JSON and the tokenization. If we do that, we may also be able to retain more information by the time the document hits the JSON stage, which could allow for better in-page hit finding (i.e. since we wouldn't be modifying the document structure, but just decorating it with attributes and adding spans, there may be ways to add better locations for each hit.)

projectEndings / staticSearch

Cleaning steps removes crucial info and needs to be rethought #246