obdurodon / dh_course

Digital Humanities course site
GNU General Public License v3.0
20 stars 6 forks source link

Variables and Distinct-Values Function #459

Closed pickettj closed 3 years ago

pickettj commented 4 years ago

I am updating some XSLT code listing all the locations mentioned in a document to list only the unique ones.

Here is the original code:

                        <td>
                            <ul>
                                <xsl:for-each select="//location">
                                    <li>
                                        <xsl:if test="./@id != ''"><xsl:text>id </xsl:text>
                                        <xsl:value-of select="./@id"/>
                                        <xsl:text>: </xsl:text></xsl:if>
                                        <xsl:value-of select="."/>
                                    </li>
                                </xsl:for-each>
                            </ul>
                        </td>

To list only the distinct values, I first assigned a variable:

 <xsl:variable name="locations" select="distinct-values(//location)"/>

Then modified the previous code to feed in the results from that variable:

                              <xsl:for-each select="$locations">
                                    <li>
                                        <xsl:if test="./@id != ''"><xsl:text>id </xsl:text>
                                        <xsl:value-of select="./@id"/>
                                        <xsl:text>: </xsl:text></xsl:if>
                                        <xsl:value-of select="."/>
                                    </li>
                                </xsl:for-each>

This returns the error: "The required item type of the first operand of '/' is node(); supplied expression (.) has item type xs:anyAtomicType."

I realize this is the same reason you can't just write distinct-values(//location)/@id, i.e. you can't move down a non-existent tree from a value sequence. Does this mean that in pulling the distinct-values() the associated tree data (e.g. attribute) information is stripped out? Perhaps there is a way to return a sequence of node locations to get around this issue?

@djbpitt , any hints? Hopefully this will be illustrative of some of the issues coming up at this stage in the course...

djbpitt commented 4 years ago

@pickettj Your diagnosis is correct: distinct-values() returns atomic values and atomic values don’t have attributes.

I don’t entirely understand the task, though. Your original code processes all <location> elements that have an @id attribute with a non-null value, and you can do that without the <xsl:if> by using //location[@id != '']. That that doesn’t necessarily get unique <location> values, though; it just gets all <location> values that have a non-null @id. If there are unique <location> values that have null @id attributes or that omit that attribute, it will miss them. And if two <location> elements have the same @id value, it will get them both, even though they may not be unique (well, the @id value isn’t unique, but the string value of the element may or may not be).

Can you provide a bit of XML and the desired output?

pickettj commented 4 years ago

Let me clarify the task:

  1. I have a document with locations marked up, which may or may not specify the id no. E.g. <location>Pittsburgh</location> and <location id = "pitt">Pittsburgh</location> are both possibilities.
  2. At the top of my reading view of the document, I want to have a quick list of all the unique locations mentioned in the document. (This is easy enough, and accomplished in the code above.)
  3. If the <location> element has an id no entered, I would like to display that as well.

Here's a screen shot of how it looks now, which is pretty close to what I want, aside from the fact that right now it just lists all of the locations rather than just the unique ones.

image

link to the code

djbpitt commented 4 years ago

@pickettj How about the following:

Input

<locations>
    <location>Pittsburgh</location>
    <location>Philadelphia</location>
    <location id="PIT">Pittsburgh</location>
    <location id="PHL">Philadelphia</location>
    <location>The Burgh</location>
    <location>Philly</location>
</locations>

XSLT

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:math="http://www.w3.org/2005/xpath-functions/math" exclude-result-prefixes="xs math"
    version="3.0" xmlns="http://www.w3.org/1999/xhtml">
    <xsl:output method="xml" indent="yes" doctype-system="about:legacy-compat"/>
    <xsl:variable name="locations" as="element(location)+" select="//location"/>
    <xsl:template match="/">
        <html>
            <head>
                <title>Locations</title>
            </head>
            <hody>
                <ul>
                    <xsl:for-each select="distinct-values($locations)">
                        <xsl:variable name="id" select="$locations[. eq current()]/@id"
                            as="attribute(id)?"/>
                        <li>
                            <xsl:value-of
                                select="
                                    if ($id) then
                                        concat($id, ': ')
                                    else
                                        (),
                                    ."
                            />
                        </li>
                    </xsl:for-each>
                </ul>
            </hody>
        </html>
    </xsl:template>
</xsl:stylesheet>

Output

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
  SYSTEM "about:legacy-compat">
<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
      <title>Locations</title>
   </head>
   <hody>
      <ul>
         <li>PIT: Pittsburgh</li>
         <li>PHL: Philadelphia</li>
         <li>The Burgh</li>
         <li>Philly</li>
      </ul>
   </hody>
</html>

This assumes that no location name (the string value of a <location> element) is associated with more than one @id. If the same location content can be associated with more than one @id (whether it’s more than one instance of the same @id value or different @id values), the XSLT will require modification. Let me know if that’s case.

pickettj commented 4 years ago

Thanks very much, @djbpitt . I have it working now (with some tweaks) as raw code, and almost have it working reconfigured as a function; you can see where I am here.


One problem I encountered was that if I had marked up more than one instance of a location, I got this error message: "A sequence of more than one item is not allowed as the value of variable $id (@id="68", @id="68")." I understood the issue to be that when <xsl:variable name="id" select="$locations[. eq current()]/@id" fired, there was more than one entry in the $locations variable that matched the current position in the for-each loop. I solved by saying, "just look the first one of those," since all matches would be the same: <xsl:variable name="id" select="$locations[. eq current()][1]/@id".

Is this how you would do it / do I understand that logic correctly?


The function version is close to working; and it works fine if my document contains at least one <location> variable:

 <xsl:param name="input" as="element(*)*"/>
        <xsl:for-each select="distinct-values($input)">
            <!-- assign local variable -->
            <xsl:variable name="id" select="$locations[. eq current()][1]/@id"
                as="attribute(id)?"/>
            <li>
                <xsl:value-of
                    select="
                    if ($id) then
                    concat($id, ': ')
                    else
                    (),
                    ."
                />
            </li>
        </xsl:for-each>
    </xsl:function>

And then I just call the function in-line for locations, and can re-use it for other similar elements: <xsl:sequence select="prv:unique_elements($locations)"/>.

The main issue with my function is that it cannot as currently written accept empty values (e.g. if I have not marked up any <location> elements, I get "An empty sequence is not allowed as the result of a call to prv:unique_elements#1" error message). I tried to remedy this using a similar strategy to what I did with the variables, by defining the parameters as <xsl:param name="input" as="element(*)*"/>. I thought that would mean "accept a sequence of 0 or more elements by any name", since as="element(location)*" worked for a similar problem with the global variable.

djbpitt commented 4 years ago

@pickettj I'll respond to the two questions (dealing with duplicate @id nodes; function) separately. This is about the duplicate @id attributes.

I can think of two reasons why selecting the first one might be risky:

  1. You are assuming that if there are multiple values, they will all be the same. If you are using Relax NG or Schematron validation to enforce that, it’s a safe assumption, although still perhaps an unnecessary one, so if there’s an error in your Relax NG or your Schematron, you wouldn’t be backing yourself up. And if you aren’t enforcing uniqueness in the schema, then there could be different values, and you’d probably want to know about that. Whatever you do with the XSLT, if this were me, I would ensure that I was enforcing uniqueness with a schema rule.
  2. Your XPath expression $locations[. eq current()][1]/@id finds the first <location> in document order that matches the current distinct values and gets its @id. On top of the issue of possibly different @id values, if you have, in document order, <location>Pittsburgh</location> and <location id="PIT">Pittsburgh</location>, your XPath expression will not retrieve the @id. The reason is that you’re taking only the first of possibly several <location> elements, and if the first one doesn’t happen to have an @id, but a later one does, you won’t see the later @id attributes.

You might want to try something like rounding up all the matching <location> elements for each distinct value, getting all of the @id values for those, and getting the distinct values of those. I think the three options will be that 1) there is only one, 2) there is none, 3) there is more than one (which would be an XML error). You can check for those and handle them differently, as necessary. XSLT 3.0 has a new try/catch mechanisms for error trapping; see https://www.w3.org/TR/xslt-30/#try-catch. Or, since it’s just you, you could let bad data throw and error an stop the transformation, whereupon you fix it and rerun.

Once your XSLT reaches a certain level of complexity, you might want to look into using XSpec, which is a framework for unit testing. The idea is that you write a bunch of tests that pass your functions or templates good and bad data to verify that you get what you expect, and as with our Schematron work in class, the hard part is anticipating all of the types of bad data and writing rules to trap them. \<oXygen/> has plug-ins to support XSpec, or you can run it from the command line. To get started, see:

  1. https://www.oxygenxml.com/doc/versions/22.0/ug-editor/topics/xslt-unit-test-xspec.html
  2. https://www.oxygenxml.com/doc/versions/22.0/ug-editor/topics/xspec-helper-view-addon.html
  3. https://www.oxygenxml.com/events/2018/webinar_xspec_unit_testing_for_xslt_and_schematron.html
djbpitt commented 4 years ago

@pickettj This message is about your function, which looks good in general. Here is a thought about the issue you report:

You didn’t include the first line in your posting, so I can’t see whether you’ve specified a datatype for the value of the function, but if not, you’ll want to do that. Make it as restrictive as you can, since you want to raise an error if it returns something of a type other than what you’re expecting. If you make the return value optional, that should remove the error. For example:

<xsl:function name="" as="element(xhtml:li)*">

This lets it return zero or more <li> elements in the HTML namespace (assuming you’ve included a namespace declaration that binds the prefix xhtml: to the HTML namespace). With respect to the @as attribute on your <xsl:param>, I think element()* would have the same meaning as element(*)*, and if that’s the case, I'd use the briefer version.