As reported by @martindholmes, spans that are not directly addressed by the config file lose crucial information that is required for weighting and contextualizing. From email:
In other words, neither the @ss-ctx nor the @ss-ctx-id attributes are
realized on the <div>element in the tokenized file.
After some debugging, it's clear that this could be resolved if the cleaning step was moved to after the weighting and contextualizing. Originally, cleaning was necessary to keep the tree size minimal during tokenizing, but that's less of a concern now that we process the documents one-by-one.
While we still need to "clean" the document so that the collection is as small as possible when it moves to the JSON, we should consider moving the cleaning stage to either the end of the tokenizing process or as a step between the JSON and the tokenization. If we do that, we may also be able to retain more information by the time the document hits the JSON stage, which could allow for better in-page hit finding (i.e. since we wouldn't be modifying the document structure, but just decorating it with attributes and adding spans, there may be ways to add better locations for each hit.)
As reported by @martindholmes, spans that are not directly addressed by the config file lose crucial information that is required for weighting and contextualizing. From email:
After some debugging, it's clear that this could be resolved if the cleaning step was moved to after the weighting and contextualizing. Originally, cleaning was necessary to keep the tree size minimal during tokenizing, but that's less of a concern now that we process the documents one-by-one.
While we still need to "clean" the document so that the collection is as small as possible when it moves to the JSON, we should consider moving the cleaning stage to either the end of the tokenizing process or as a step between the JSON and the tokenization. If we do that, we may also be able to retain more information by the time the document hits the JSON stage, which could allow for better in-page hit finding (i.e. since we wouldn't be modifying the document structure, but just decorating it with attributes and adding spans, there may be ways to add better locations for each hit.)