Question about generating Contents from n values on pb

JanelleJenstad commented 1 year ago

Question for @martindholmes (when you are back):

In the hamburger menu for semi-diplomatic (OS) texts, the items in the Contents list are generated from the @n value on the <pb> element.

Why are these not being generated for some files, including the Douai files (such as emdDouai_JC)? Is it because of the catDescs? The encoding of <pb/> is exactly the same in emdFV_Q1 and in emdDouai_JC. But the former has lbfQuarto and the latter has lbfManuscript.

Example:

<pb n="A1v" /> in the XML of emdFV_Q1 yields the following HTML.

<div data-el="list" role="list">
--
  | <div data-el="item" role="listitem"><a data-el="ref" data-target="#emdFV_Q1_A1r" href="#emdFV_Q1_A1r">A1r</a></div>
  | <div data-el="item" role="listitem"><a data-el="ref" data-target="#emdFV_Q1_A1v" href="#emdFV_Q1_A1v">A1v</a></div>

martindholmes commented 1 year ago

It looks like the TOC is only generated from the pbs when the text qualifies as diplomatic:

<xsl:function name="hcmc:isDiplomatic" as="xs:boolean">
    <xsl:param name="root" as="element(tei:TEI)"/>
    <xsl:choose>
      <!--First check whether or not the thing is primary-->
      <xsl:when test="hcmc:isPrimary($root) and not($root/descendant::tei:catRef[matches(@target,'Modern')])">
        <xsl:value-of select="true()"/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="false()"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:function>

To qualify, it needs to be NOT Modern, and to be primary:

<xsl:function name="hcmc:isPrimary" as="xs:boolean">
    <xsl:param name="root" as="element(tei:TEI)"/>
    <xsl:choose>
      <xsl:when test="$root/descendant::tei:catRef[matches(@target,'ldtPrimary')]">
        <xsl:value-of select="true()"/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="false()"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:function>

So as far as I can see, emdDouai_JC should qualify, so this is a bit puzzling. I'll look into it further.

martindholmes commented 1 year ago

Ah -- pbs only qualify if their @n attribute matches the regex for a signature, because the TOC is supposed to be constructed based on signatures:

<xsl:variable name="sigRegex" select="'\d*[A-Z]+\d+(r|v)'" as="xs:string"/>

i.e. zero-or-more digits, one or more capital letters, one or more digits, and r or v.

But emdDouai_JC does not have signatures, it has page numbers /^\d+[rv]$/. So there's your problem. Do we want to modify the definition of a signature to include anything that occurs in a page-number? We could do that by making the upper-case letter(s) optional. If we do that, will there now be other texts which suddenly get TOCs listing every page when they didn't before?

martindholmes commented 1 year ago

Incidentally 167r is missing its r.

JanelleJenstad commented 1 year ago

Got it ... and found two more. Merci!

JanelleJenstad commented 1 year ago

Yes, modifying the regex for a signature would be a nice solution and fix a whole bunch of other files at the same time. Please go ahead.

JanelleJenstad commented 1 year ago

Rationale: we will have nearly as many manuscript as printed plays, if I can get enough editors lined up! So we need to accommodate foliation as well as signature sequences.

martindholmes commented 1 year ago

As of rev 12937, sigRegex is now sigFolRegex, and foliation numbers will be matched and treated in the same way as signatures. @JanelleJenstad bear in mind that the way the code is written to generate a TOC from pb/@n values, if it finds a single pb/@n that does not match the regex, it will not build the TOC, so these values have to be right. If the build works as hoped, this can be closed.

martindholmes commented 1 year ago

This ended up proving very messy and not working at all, so I've now rolled back to the original rev of the XSLT before I did anything. We'll need a comprehensive list of all the different forms that can appear in pb/@n and how each of them should be handled when it comes to constructing, or deciding not to construct, a table of contents.

martindholmes commented 1 year ago

Final decision on how to proceed:

Leading instances of /\d+;\s*/ in pb/@n have been removed per issue #145.
The $sigRegex is far too restrictive, since all sorts of things occur in genuine sigs, as well as folio numeration, so the only restriction should be that there are no spaces.
pb/@n should always be used to construct ids, and therefore Schematron should mandate that all pb/@n values are unique within the document.
When constructing a TOC, check to see if there are divs with heads; if so, use those. Failing that, if there are pb/@ns, then use those.

martindholmes commented 1 year ago

There are six instances of pb/@n which are not unique, found with this XPath: //pb[@n = following::pb/@n]. These will need to be fixed before we proceed.

JanelleJenstad commented 1 year ago

Thanks for the XPath. We have now found and fixed all six instances. Please proceed at your convenience.

martindholmes commented 1 year ago

There's one more thing I think we should consider before we proceed. A TOC which consists of every pb in the document is going to be read out by a screen reader at painful length, and will be a horrible thing for someone trying to navigate the document using a keyboard. Are we absolutely sure it's a good idea to create an index to every page?

martindholmes commented 1 year ago

In the case of emdDouai_JC, there are labels throughout the text. Wouldn't it be better, in the absence of div/heads, to use any label elements to create a more meaningful TOC? @JanelleJenstad @LEMDO-PM don't you think this would be more reader-friendy than a list of all the pages?

martindholmes commented 1 year ago

Whatever is decided here should wait on issue #146.

martindholmes commented 1 year ago

In rev 13154, I've committed changes build the in-page TOC as follows:

If there are div/head elements in the text, use those.
If there are no div/heads, but there are more than four label elements, use those.
Otherwise, if there are signatures, use those.

This seems a decent compromise all round, and works nicely for e.g. emdDouai_JC. Waiting to see if it builds successfully.

martindholmes commented 1 year ago

This is working, but there is an extra feature we might want to allow: for cases where the actual text of the label is perhaps not ideal as a TOC entry, we could provide an option to put @n on a label, to provide an edited or canonical caption to use in the TOC. Then I would first check to see if there are more than four labels with @n, and if so, use only labels with @n; if there are fewer than four, but still more than four labels, then I would use all labels but substitute @n where it appears; and if there are fewer than four labels, I would use pb/@n. Waiting to see if the Douai folks would like to use this feature before adding it to the implementation.

martindholmes commented 1 year ago

No response from the Douai folks on this, so I'm closing it.

JanelleJenstad commented 1 year ago

I only just asked the Douai folks in a Zoom consultation. Yes, the Douai folks would like to be able to add an @n value to <label> so that they can (in a few cases) override the text node of the <label> element. I am re-opening this ticket and asking @martindholmes to work on it after his vacation.

martindholmes commented 1 year ago

@JanelleJenstad Let's nail down the specification, then. The options are:

If there are more than four labels, we use labels, and whenever one has @n we use that, but for all others we use their content. This means all labels would be used whenever there are four or more in the document.
If there are more than four labels with @n, then we use only labels with @n, ignoring those which don't have @n. This enables the encoders to distinguish between labels which they want to use in a TOC and those they don't. If there are fewer than four with @n, then we proceed as in no. 1 above.
If there are any labels at all, we use labels for the TOC and ignore pbs; we use @n where it exists, and the label content where it doesn't.
Something else?

Four is just an arbitrary number I thought might make sense for a decent-sized text, but that number can be anything you think is appropriate.

JanelleJenstad commented 1 year ago

Changing the milestone on this one because Douai wants it.

This specification is great. Four seems reasonable to me. We can tweak the number if necessary.

Please go for it.

martindholmes commented 1 year ago

The label "Question" threw me on this -- it should be implemented, if in fact it hasn't already been done. I'll go back and look at commits.

martindholmes commented 1 year ago

Final implementation added in rev 16361. I'll now add some documentation.

martindholmes commented 1 year ago

Documentation added in rev 16362. If builds are all OK, this can be closed.

martindholmes commented 9 months ago

Note before closing: At JJ's request, the threshold for using labels has been changed from four labels to twenty labels. This does not affect current Douai texts, which all have more than twenty, but if/when the Douai team start tagging new texts, the labels will not be used until there are twenty of them.

projectLEMDO / lemdoIssues

Question about generating Contents from n values on pb #144