obdurodon / dh_course

Digital Humanities course site
GNU General Public License v3.0
20 stars 6 forks source link

Using tokenize() with Schematron #177

Closed zme1 closed 5 years ago

zme1 commented 5 years ago

I started working on the second schematron assignment, and quickly ran into the problem of how to count hyphens and spaces. I'm not quite sure how to write an Xpath expression to find that exactly, so I searched the internet for answers. I only found examples using XSLT, where one can declare variables. Currently, I am thinking that I need the count, tokenize and string functions. Any help would be sincerely appreciated.

Here is what I have so far:

<?xml version="1.0" encoding="UTF-8"?>
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
    xmlns:sqf="http://www.schematron-quickfix.com/validator/process"
    xmlns="http://purl.oclc.org/dsdl/schematron">
    <pattern>
        <rule context="sentence">
            <assert test="count(tokenize(string('-')))">The number of hypens must match in orth, translit and ilg.</assert>
        </rule>
    </pattern>
</sch:schema>
zme1 commented 5 years ago

count(), tokenize(), and string functions are all part of XPath, which means that they're all available inside Schematron, just as they are in XSLT or XQuery. We used the count() and sum() functions in class today; you can use any other XPath function the same way.

In your sample, you've wrapped string() around a one-character string that consists of a single hyphen, so the output of the string() function will be the same as the input: it will convert the string you already have into a string that's identical to it—that is, it doesn't do what you want. Your syntax for tokenize() is problematic because tokenize() requires two arguments: the string that you're breaking into pieces and the regex that you're using to identify where to break. You'll want to look this up in Michael Kay (concentrate on the examples) or w3schools, which is what we do when we use functions we haven't used before, too. We also use tokenize() in the Skyrim example, to which we've linked in the Schematron section of our main course page, so you can see it in action there, as well.

At the moment, your tokenize() function has only one argument, so you'll raise an error. But you're on the right track: if you divide "blah-blah-blah" into pieces by tokenizing on a hyphen and then count the pieces, you'll get a count of three. There are actually two hyphens, but you can get equally useful results from counting pieces as you get from counting hyphens, since you need a comparison, and not an exact count. That is, if each tier has three tokens, that's as good as verifying that each tier has two hyphens.

And comparison is the next step: your test can't just fix the tokenize() syntax to count the number of hyphens in one of the tiers; it has to compare that to a count of the hyphens in another tier, and test whether the counts are equal. There are several ways to approach this problem (since there are multiple tiers), but one way to think about it is that if you want to verify that the number of hyphens in the tiers is consistent, you need to compare those counts to one another.

You did some comparisons of stooge data in class today, so you have some ideas about how to think about translating the human expression of this type of problem into Schematron and XPath. But this is a harder problem than the one with the stooges, so we'd suggest turning it into a bunch of smaller, easier problems and approaching it step by step. The first step is making sure you can get tokenize() to return the type of information you need, and you can do that by experimenting with it in the XPath browser box in the upper left corner or writing a short XSLT stylesheet to output some diagnostic information. Once you can tokenize one of the tiers, you'll want to work on comparing the counts of just two tiers, whether for hyphens or for spaces. Once you have that working, you can expand the task to include all of the comparisons. For what it's worth, that's how we developed our own Schematron in our own projects: we get the XPath working first and then we incorporate it into the Schematron shell.

Note that we've simplified the problem by doing the comparisons among the entire text of the tiers. You don't have to compare word by word; that's possible, but very difficult, and if you can do the comparison just on the entire text of the tiers, that's a good outcome.

zme1 commented 5 years ago

I am having similar problems, but this is what I have down for my code.

<?xml version="1.0" encoding="UTF-8"?>
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
    xmlns:sqf="http://www.schematron-quickfix.com/validator/process"
    xmlns="http://purl.oclc.org/dsdl/schematron">
    <pattern>
        <rule context="sentence">
            <assert test="//orth[count(string('-'))] eq //translit[count(string('-'))] and //igl[count(string('-'))]"> The number of hyphens in orth must match the number of hyphens in translit and in igl.</assert>
        <./rule>
    </pattern>
</sch:schema>
zme1 commented 5 years ago

When we have to develop a complex XPath, we find it easiest to write the parts separately and then glue them together. For this task, for example, we'd start by figuring out (experimenting in the XPath browser box in the upper left corner of the <oXygen/> interface) how to count the hyphens or spaces in one of the tiers. There are a few different ways to do that. Once you can count one of the tiers, you can build on that to count and compare two of them, and then more, as needed.

In the example above, we'd also specify the <sentence> element as the value of @context, but we'd look only at the children of the one sentence, that is, at the one <orth> child, the one <translit> child, and the one <ilg> child (and you have a typo in that last one; you've written "igl" instead of "ilg"—it stands for "interlinear gloss"). You've begun your paths with double-slashes, which means that you're ignoring the current context and rounding up all of the elements of those names in the entire document. That will do what you want as long as you have only one sentence, but if you have more, it will break in two ways. First, you'll be trying to compare the hyphens or spaces across completely different sentences. Second, you've used value comparison ("eq"), which is fine as long as you're comparing only one thing to one thing, but if you have multiple sentences, you'll have multiple elements in the document of each of these types, and you'll throw an error as a result.

Let's look at a piece of the XPath closely:

//orth[count(string('-'))] Here you're rounding up all of the <orth> elements in the entire document and filtering them with a predicate. Inside the predicate you have a one-character string on the inside, which consists of a single hyphen. You wrap that in the string() function, which converts it to a string, but it's already a string, so the string() function has no effect here (and therefore shouldn't be used). You then wrap that in count(), which counts the number of items in the argument, but you know there's only one item, the one-character string you started with. The predicate, then evalutes to "1", which evaluates to True (the Boolean [True/False] value of all numbers is True except 0, which is False), so your filtering doesn't do anything.

So what should you do? You're right to start with the <orth> element (except that you should omit the double slashes at the beginning, so that if you have more than one sentence, you'll get only the <orth> child of the sentence you're processing at the moment). You then need to find the number of hyphens in that element. One way to do that is with the count() function, so you're on the right track there, but you can't just count hyphens directly. The reason is that the individual characters of a string aren't countable; the string is a single item (if you count the string, the value of the count will be 1), and you have to break it up into pieces in order to count specific characters in it. Here are three strategies:

  1. You can get the length of the original string with string-length(). You can then use translate() to replace all of the hyphens in the string with nothing and get the length of the result. When you subtract the latter from the former, the difference will be the number of hyphens.
  2. You can use tokenize() to split the string into pieces by dividing at hyphens. The number of pieces will be one more than the number of hyphens.
  3. XPath doesn't have a function that makes it easy to break a string into a sequence of individual characters. But is a string-to-codepoints() function (see Michael Kay for details) that splits a string into a sequence of numerical values that correspond to the individual letters. And there's a matching codepoints-to-string() function that does the reverse. You can combine these to get a sequence of individual characters, you can then use a predicate to keep only the ones that are hyphens, and then count them. We usually use 1 and sometimes 2. 3 is the fussiest of the three, and therefore harder to develop.