projectEndings / staticSearch

A codebase to support a pure JSON search engine requiring no backend for any XHTML5 document collection
https://endings.uvic.ca/staticSearch/docs/index.html
Mozilla Public License 2.0
46 stars 21 forks source link

We should be using XPath map/array handling rather than serializing to/from XML #239

Open joeytakeda opened 2 years ago

joeytakeda commented 2 years ago

In json.xsl, we create stash the created map in a variable that are then serialized using xml-to-json(). E.g.:

https://github.com/projectEndings/staticSearch/blob/c1dbd48281dfec376a3da6294a1469e384d489da/xsl/json.xsl#L204-L209

That map is not a map{} in the XPath sense, but an XML structure in the JSON namespace:

https://github.com/projectEndings/staticSearch/blob/c1dbd48281dfec376a3da6294a1469e384d489da/xsl/json.xsl#L322-L354

However, there's no need to do that since we could use <xsl:map> and set the output method to json:

 <xsl:result-document href="{$outDir}/stems/{$stem}{$versionString}.json" method="json">
       <xsl:call-template name="makeMap"/>
</xsl:result-document>
<xsl:map>
     <xsl:map-entry key="'docId'" select="string($thisDoc/@id)"/>
     <xsl:map-entry key="'docUri'" select="string($thisDoc/@data-staticSearch-relativeUri)"/>
<!--[...]-->
</xsl:map>

I'm not sure if there was a reason why we didn't do this in the first place, but I can't think of any reason not to do so at this point. It may improve performance slightly as well as possibly lessen memory use. (There's some indication that constructing sequences in the result-document allows Saxon more leverage when it comes to releasing memory; I don't imagine it would be significant, but worth testing anyway.) In any case, it's a bit cleaner and feels a bit better to use built-in functions.

The only annoying bit is that there is no such thing (yet) as an <xsl:array> instruction[1], so we would still need to stash all of the context arrays in a variable in order to create an array using the array{} constructor, but that's simple enough.

-- [1]: They are proposing such a thing for XSLT 4.0: https://www.saxonica.com/html/documentation11/xsl-elements/array.html

martindholmes commented 2 years ago

I'll be intrigued to see if this makes any difference. I must admit that I've never really understood what happens when you use method="json" and what the consequences of @json-node-output-method actually are. I would guess that by this time, xml-to-json() is very solid and well-optimized, and the construction of an in-memory XML tree precursor is also pretty optimal so there would be little to gain, but it would certainly be nice to reduce the memory requirement here if it works.

joeytakeda commented 2 years ago

I'm going to experiment with this in branch iss239-json

joeytakeda commented 2 years ago

I have now done this in the iss239-json branch, but my tests are so far inconclusive.

In my local copy, I've added the repeat option and the -t option to the JSON step:

    <java classpath="${ssSaxon}" classname="net.sf.saxon.Transform" failonerror="true">
      <arg value="-xsl:${ssBaseDir}/xsl/json.xsl"/>
      <arg value="-s:${ssBaseDir}/xsl/json.xsl"/>
      <arg value="-repeat:6"/>
      <arg value="-t"/>
      <arg value="--suppressXsltNamespaceCheck:on"/>
    </java>

I've attached the report for the test set to this ticket (with the respective branch names in the file). I'm going to test now with just the English TEI guidelines to see if those results are more conclusive.

json_dev.txt json_iss239-json.txt

martindholmes commented 2 years ago

It's not making much difference. Does it make for more human-readable, maintainable code in your view?

joeytakeda commented 2 years ago

I think it is, but at the same time, <j:map> has the advantage of having the j:array[@key], which is much nicer than stuffing a bunch of stuff in a variable to put in the array{} constructor. So I think it may be worthwhile to hold off on this (switching the code was fairly trivial, so it's fine to throw it away) until <xsl:array> is part of the spec.