mbakeranalecta / sam

Semantic Authoring Markdown
Other
79 stars 8 forks source link

Dereferencing compound identifiers #155

Closed mbakeranalecta closed 6 years ago

mbakeranalecta commented 7 years ago

By compound identifiers I mean ones that reference one resource inside another, such as a figure inside a particular chapter, as opposed to just referencing the figure itself. SAM does not have any syntax support for this, and it is not clear that it should. Below is a email exchange that explores the issue, which I am capturing here so it can be thought about more:

Hi Richard,

That is an interesting use case that I have not really thought through until now.

SAM does not provide any facility for dereferencing compound identifiers (#chapter.separate/ex.separate-1), but then again, neither does XML. That is purely an application layer thing.

But unlike XML, SAM provide specific syntax for dereferencing identifiers in the form of citation markup. And SAM also makes an explicit distinction between names and IDs at the authoring level, whereas in XML the distinction between ID/IDREF and all other forms of identifier creation and dereferencing that you may decide to invent at the application layer, is not expressed in the markup, but only in the schema. The author does not know, when creating an attribute id="foo" if the id attribute is of type "ID" or not. Similarly when they create a reference idref="foo", they do not know if the attribute idref is of type IDREF or not (and therefore if it must be resolved locallaly or not.)

The decision in SAM to force the author to choose between an ID with a * and a name with a # was based on my distaste for hidden semantics. If I am forcing the author create things that have different rules, then they should have a different form so that the distinction is clear.

But when I look at the case you propose, it is clear that this approach makes dereferencing of compound identifiers more problematic. If the compound identifier is in the form

[#chapter.separate/ex.separate-1]

Is ex.separate-1 an ID or a name? And based on the principle that the distinction between names and ids is explicit in markup, shouldn't it be:

[ #chapter.separate/#ex.separate-1]

Or:

[#chapter.separate/*ex.separate-1]

The current XML serialization of SAM would render the above as:

<citation type="nameref" value="chapter.separate/*ex.separate-1"/>

That is by no means impossible to process. It requires the processor to break apart the nameref value, but any compound identifier is going to require that. On the other hand, the semantics are a bit wonky. This is not really a nameref anymore. The final resource being identified has an ID rather than a name. Still, there is nothing (currently) to prevent the application designer from implementing this in their markup language and its processors.

However, this will only work for names as the first identified resource. The following would produce a parser error:

[*section.separate/*ex.separate-1]

This is an error because the parse will consider the entire string as an ID, which means it won't match.

In an XML vocabulary it is possible to create a reference to an ID without running into this issue by not declaring the reference to be of type IDREF. In SAM, the citation of a ID is always an IDREF.

Of course, when you are designing your markup language, nothing forces you to use IDs. You can use names for all references, and the current processing will give you free reign to create whatever conventions you like for dereferencing names at the application layer. So I am not sure that there is anything to be gained by creating more explicit support for compound identifiers in SAM.

In any case, compound identifiers are all about namespaces. In SAM (and XML) the namespace of an ID is the current file. In XML, any other identifiers is an invention of the application layer and the application layer can make the scope of its namespace anything it likes. It can make names global in scope but specific to different types of objects, so that a footnote named #foo is different from a figure named #foo or a page named #foo. But this assumes that you have different constructs for dereferencing the names, so that the reference to a footnote is different from a reference to a figure or a page.

SAM's names do not work like that. The dereferencing of a name by a citation makes no statement about what is being dereferenced. Thus [#foo] becomes a reference to a footnote because the application layer looks up what type of thing has that name and formats it accordingly. In other words, the format is determined based on the type of the object named, not the type of the reference. This means that SAM names have a global namespace with respect to types. The application layer could decide to restrict their namespace to the file in which they occur, or indeed to any subset of the docset it chooses, but it can't restrict it by type.

All of which suggests that it is best to treat SAM names as global in scope in all senses. That global scoping is implicit in the naming scheme that you use for names (and which I have attempted to follow) which is that the first part of the name is a type identifier (#figure.foo").

And if names are global in scope, you don't need to do:

 [#chapter.separate/#ex.separate-1]

Because the name #ex.separate-1 is global in scope anyway. So

 [#ex.separate-1]    

is all you need to identify that resource.

Because we are working one step back from DocBook, we could fairly construct "Figure 8.1 in Chapter 8" by looking back from the element with the name #ex.separate-1 to its parent chapter then constructing a reference in the form:

 <db:xref linkend="{$nameref}"/> in <db:xref linkend="{$chapter-name}"/> 

The only complication here is that we would have to look through all the files that make up the book to find chapter[//*[name=$nameref]. Easy enough to do in XSLT2 (or in SPFE). But this is one of those cases when you have to ask the design question of whether you want the author to do this lookup when they create the reference (and possibly get it wrong or have it grow stale) or do you want to have the build look it up at build time (and thus find it even if it has since moved to another chapter).

It is certainly true, though, that SAM's name facility is not as flexible as some of the naming conventions you could invent in a XML-based language where you can use arbitrary attributes to create arbitrary addressing and dereferencing schemes. This is deliberate, because relationships based on arbitrary names don't scale well without big iron CMSs that add complexity and reduce functional lucidity. SAM wants you to manage relationships based on subject annotations as far as possible, not arbitrary names or IDs.

SAM does have one other identifier dereferencing mechanism, however, and that is key citations.

 [%foo]

SAM does not provide a mechanism for creating keys or any rules for managing them. That is entirely up to the application layer. One way it might do this would be:

key:
      name: foo
      chapter: chapter.separate
      example: ex.separate-1

If DITA has taught us anything it is that when you are creating compound identifiers, things will get out of hand as the scale increases and keys will help restore some order (or at least some management potential). Subject annotation is still infinitely preferable, where practical, but keys are sometimes a necessary fallback.

Sorry, that is very rambling, and I don't blame you if you skipped most of it. It is really just me thinking through the issues off the top of my head.

Mark

Hi Mark,

DocBook provides some complex ways to handle links that are external to a particular file (e.g., olink). I don’t use those methods; I just use standard IDs and don’t worry if that makes an individual file technically invalid, since the target doesn’t exist. Once the book is assembled using xinclude, that’s when it matters.

So, what you’re doing right now works fine. When I need an external link, I’ll create one.

That said, it might be interesting to be able to link to something like:

chapter.separate/ex.separate-1. What might make that interesting is

that you could imagine more ways to format links, such as: Figure 8.1 in Chapter 8, which is not easy to do in DocBook automatically.

But, that’s all for future contemplation:-).

FYI, I’m digging into composition.sam now, though I won’t have much before tomorrow.

Richard

On Aug 6, 2017, at 16:29, Mark Baker wrote:

The difference between and # is that is an ID the is local to the current file and # is a name that is not required to be local to the current file (though it could be). Whether or not they are used to create links is dependent on where they are use. In the case of a citation, however, they are used to cite something internal to the file or (potentially) external to it, respectively. Technically there is no guarantee that the name referenced by # is external to the current file. It just means that the parser does not verify the reference. But if you want to reference something outside the current file you have to use #. Not sure if docbook requires the target of external link to be external. If so, then technically sam2docbook.xsl should check the destination and create the appropriate link type.

Mark

Sent from TypeApp On Aug 6, 2017, at 7:16 PM, Richard Hamilton wrote: Hi Mark,

I just did a pull request for processing.sam, along with a small change to separatecontentfromformatting.sam.

The small change was to change an internal link into an external link (that is, ex.separate-1 to #ex.separate-1). This brings up a question. Is the only difference between the and the # whether the link is external to the current file, or is there some other difference? I needed to refer to this figure from processing.sam, and it looks like using the # makes this an external link. Is that all that I needed to do?

mbakeranalecta commented 7 years ago

I think the answer to this is to leave it alone, but it is probably worth noting somewhere the limits on names and IDs relative to what you can create in XML

mbakeranalecta commented 6 years ago

Addressed in cfe90dce9f780f90c7665d3c081cc904660f8c2b by adding to docs.

mbakeranalecta commented 6 years ago

Further note to add to the docs. Spaces are legal in names. This would allow you to treat nameref as NMTOKENS in XML and access compound names that way.

mbakeranalecta commented 6 years ago

So, per #168, spaces are not legal in names, but text after a space is supposed to parse into an extra attribute which is just as useful. It says, here is the name to reference and here is some extra data to identify the thing to insert. Of course, this give extra a different semantic than it has in a plain citation, but citations used in inserts have a different semantic anyway. And this is an application layer semantic. SAM is not saying that this is what you must use extra for. That is up to you.

mbakeranalecta commented 6 years ago

The extra attribute, mentioned in the last entry, is an artifact of using citation syntax for inserts by reference. But if we reverse that decision per #161 then there is no longer an extra by default unless we decide to add on. The only reason to do so would be to support this compound identifier case, but it is not a complete or obvious solution to this problem, and therefore not the way we should support it if we are going to support it.

If we wanted to support compound identifiers, the obvious way to do this it to use / as a separator:

[ #chapter.separate/#ex.separate-1]

That is clear and easy to understand. This issue is, how to serialize it.

The serialization of a simple identifiers is:

<citation nameref="chapter.separate"/>

This serialization breaks the type identifier # from the name chapter.separate so there is no indication in the value attribute of what type of value it is. Thus we can't simply do:

<citation nameref="chapter.separate/ex.separate-1"/>

Because this obscures the type of ex.separate-1. To serialize a compound identifier, we need to maintain the type of each part of the identifier. That means we have to find a way to represent each step of the name in the citation element or retain the leading type indicator.

One way to do this would be:

<citation compound-ref="#chapter.separate/*ex.separate-1"/>

Here the processing application has to break apart the value to get the type and values when it sees compound-ref. This is not hard to do in XSLT using tokenize() and substring methods. Tokenize is not part of XSLT1 but is usually available as an extension function in most environments (Python, for instance).

The alternative to avoid the need to tokenize is to use nested citations. One could do:

<citation nameref="chapter.separate">
      <citation idref="ex.separate-1"/>
</citation>

(And similarly for inserts.)

This avoids the need for the processing application to do any parsing. Does it make the processing logic any easier, though? It means that when you see a citation, you have to look for child citations to resolve the reference fully. Is that problematic?

mbakeranalecta commented 6 years ago

Should note that this is not just about citations. Identifiers are used in inserts as well. How would a compound insert work?

<insert nameref="chapter.separate">
      <insert idref="ex.separate-1"/>
</insert>

This seems weird compared to the citation case. (Citations can be "this book on this page".)

Another way to express this would be:

<insert>
     <nameref value="chapter.separate">
          <idref value="ex.separate-1"/>
    </nameref>
</insert>

Which would translate to citations as:

<citation>
     <nameref value="chapter.separate">
          <idref value="ex.separate-1"/>
    </nameref>
</citation>

This seems much more semantically clear. But then the question becomes whether you serialize all inserts and citations like this or if you leave the single citation cases as they are now.

Actually, making the different might make processing easier, since you can write separate rules for inserts and citations that don't have an ref or a value attribute.

mbakeranalecta commented 6 years ago

From a schema point of view you have to ask if the schema can allow or disallow compound identifiers.

mbakeranalecta commented 6 years ago

If we are supporting compound incerts and citations, this raises the question of whether we should support sub-document URLs.

>>>(text foobar.sam/*bonk)

This would mean defining a new controlled type text or else types for formats like sam and xml, though the definition of include types can be left to the application layer as well. (Actually, of course, it is the type of the object, not the type of the file that really matters. That could be confusing, but it can be left to the application layer.

Adding an include type for this seems the obvious course for includes. What about citations? There is no obvious extension of current syntax to support this.

[foo.com/index.html#bar]

Already means something.

[foo.com/index.xml/#bar]

Is technically different, but not different enough to be lucid.

We would need a different type of citation completely to support this. But this is difficult when one of the forms of the citation is plain text. Introducing another symbol just for this would be messy and not lucid.

Another way to look at this is simply to say that inserting or citing a URL is just that, and that URLs already have syntax of sub-resource identification. So, we don't extend SAM compound identifier syntax to external files because URLs already do all that is needed. Interpreting them is, of course, up to the application layer.

mbakeranalecta commented 6 years ago

Should the serialization be:

<insert>
     <nameref value="chapter.separate">
          <idref value="ex.separate-1"/>
    </nameref>
</insert>

or

<insert>
     <nameref value="chapter.separate"/>
     <idref value="ex.separate-1"/>
</insert>

The nested form is certainly more expressive, but which is easier to process. The nested form requires iterative processing. The list form requires you to look at each item in the list in turn to determine where the list ends.

We should also consider that when it comes to citations, they can have things nested inside them. Thus

<citation>
     <nameref value="chapter.separate">
          <idref value="ex.separate-1"/>
    </nameref>
</citation>

Can become:

<phrase>
<citation>
     <nameref value="chapter.separate">
          <idref value="ex.separate-1">
                Foo bar
          </idref>
    </nameref>
</citation>
</phrase

which would make the list form:

<phrase>
<citation>
     <nameref value="chapter.separate"/>
     <idref value="ex.separate-1"/>
    Foo bar
</citation>
</phrase>

Which of these is easier to process? Or should we come up with another form where the processing problem is more constrained? Maybe:

<phrase>
<citation>
     <citation-elements>
          <nameref value="chapter.separate"/>
          <idref value="ex.separate-1"/>
     </citation-elements>
     Foo bar
</citation>
</phrase>

That would be much cleaner to process. Could do the nested format this way as well, but is there a benefit to it?

Maybe it would be easier still to do it this way:

<phrase>
<citation>
     <citation-elements>
          <citation-element type="nameref" value="chapter.separate"/>
          <citation-element type="idref" value="ex.separate-1"/>
     </citation-elements>
Foo bar
</citation>
</phrase>

That way you don't have to deal with different element types to extract all the elements of the reference.

mbakeranalecta commented 6 years ago

It shouldn't be <citation-elements> and <citation-element> as these apply to inserts as well. It should be <reference-elements> and <reference-element>.

mbakeranalecta commented 6 years ago

Can variables be inserted using compound identifiers?

>(#foo/$bar)

I don't see any reason why not at the syntax level. It would be up to the application layer to implement it accordingly.

mbakeranalecta commented 6 years ago

Related question: should variables be allowed in citations?

[$moby page 12]

Currently this is just seen as a value citation. But if you do this:

[#Melville/$bananas page 12]

You get:

SAM parser ERROR: Invalid compound identifier at: #Melville/$bananas

That is fine in itself, but the consistency may not be apparent to the average user.

Another question about variables in compound identifiers is what positions they are allowed in. What would:

 >($foo/*bar) 

mean, for instance? Technically it should mean the object with the id bar inside the variable foo. But if you are doing those kinds of tricks I would seriously question the functional lucidity of your system.

You can't ban every possible piece of nonsensical markup, of course, but nor should we try to support cases whose semantics are as dubious as this.

The question remains, then, whether we should detect and raise and error for

[$moby page 12]

We should note that the semantics of this are very different from

[#moby page 12]

The latter is a citation by reference to an object. The former is citation by value with the value filled in by a variable. It is effectively the illegal:

[>($moby) page 12]

We don't allow nested markup elsewhere, so why make a special case for this? The no nesting rule is one of the fundamental simplifying features of SAM markup, and violations for minor cases just because the the syntax would work does not seem appropriate.

mbakeranalecta commented 6 years ago

While it makes sense that annotations should be wrapped around the phrase text for ease of processing, it does not make as much sense for citations. Rather than:

<phrase>
<citation>
     <citation-elements>
          <citation-element type="nameref" value="chapter.separate"/>
          <citation-element type="idref" value="ex.separate-1"/>
     </citation-elements>
Foo bar
</citation>
</phrase>

It would seem to make more sense to do:

<citation>
     <citation-elements>
          <citation-element type="nameref" value="chapter.separate"/>
          <citation-element type="idref" value="ex.separate-1"/>
     </citation-elements>
</citation>
Foo bar
</phrase>

This would match how citations work on block quotes.

mbakeranalecta commented 6 years ago

Addressed in 1f20902624d29dab002353df8374952c63fff81d