Closed mbakeranalecta closed 6 years ago
I think the answer to this is to leave it alone, but it is probably worth noting somewhere the limits on names and IDs relative to what you can create in XML
Addressed in cfe90dce9f780f90c7665d3c081cc904660f8c2b by adding to docs.
Further note to add to the docs. Spaces are legal in names. This would allow you to treat nameref as NMTOKENS in XML and access compound names that way.
So, per #168, spaces are not legal in names, but text after a space is supposed to parse into an extra
attribute which is just as useful. It says, here is the name to reference and here is some extra data to identify the thing to insert. Of course, this give extra
a different semantic than it has in a plain citation, but citations used in inserts have a different semantic anyway. And this is an application layer semantic. SAM is not saying that this is what you must use extra
for. That is up to you.
The extra attribute, mentioned in the last entry, is an artifact of using citation syntax for inserts by reference. But if we reverse that decision per #161 then there is no longer an extra by default unless we decide to add on. The only reason to do so would be to support this compound identifier case, but it is not a complete or obvious solution to this problem, and therefore not the way we should support it if we are going to support it.
If we wanted to support compound identifiers, the obvious way to do this it to use /
as a separator:
[ #chapter.separate/#ex.separate-1]
That is clear and easy to understand. This issue is, how to serialize it.
The serialization of a simple identifiers is:
<citation nameref="chapter.separate"/>
This serialization breaks the type identifier #
from the name chapter.separate
so there is no indication in the value
attribute of what type of value it is. Thus we can't simply do:
<citation nameref="chapter.separate/ex.separate-1"/>
Because this obscures the type of ex.separate-1
. To serialize a compound identifier, we need to maintain the type of each part of the identifier. That means we have to find a way to represent each step of the name in the citation element or retain the leading type indicator.
One way to do this would be:
<citation compound-ref="#chapter.separate/*ex.separate-1"/>
Here the processing application has to break apart the value to get the type and values when it sees compound-ref
. This is not hard to do in XSLT using tokenize() and substring methods. Tokenize is not part of XSLT1 but is usually available as an extension function in most environments (Python, for instance).
The alternative to avoid the need to tokenize is to use nested citations. One could do:
<citation nameref="chapter.separate">
<citation idref="ex.separate-1"/>
</citation>
(And similarly for inserts.)
This avoids the need for the processing application to do any parsing. Does it make the processing logic any easier, though? It means that when you see a citation, you have to look for child citations to resolve the reference fully. Is that problematic?
Should note that this is not just about citations. Identifiers are used in inserts as well. How would a compound insert work?
<insert nameref="chapter.separate">
<insert idref="ex.separate-1"/>
</insert>
This seems weird compared to the citation case. (Citations can be "this book on this page".)
Another way to express this would be:
<insert>
<nameref value="chapter.separate">
<idref value="ex.separate-1"/>
</nameref>
</insert>
Which would translate to citations as:
<citation>
<nameref value="chapter.separate">
<idref value="ex.separate-1"/>
</nameref>
</citation>
This seems much more semantically clear. But then the question becomes whether you serialize all inserts and citations like this or if you leave the single citation cases as they are now.
Actually, making the different might make processing easier, since you can write separate rules for inserts and citations that don't have an ref or a value attribute.
From a schema point of view you have to ask if the schema can allow or disallow compound identifiers.
If we are supporting compound incerts and citations, this raises the question of whether we should support sub-document URLs.
>>>(text foobar.sam/*bonk)
This would mean defining a new controlled type text
or else types for formats like sam
and xml
, though the definition of include types can be left to the application layer as well. (Actually, of course, it is the type of the object, not the type of the file that really matters. That could be confusing, but it can be left to the application layer.
Adding an include type for this seems the obvious course for includes. What about citations? There is no obvious extension of current syntax to support this.
[foo.com/index.html#bar]
Already means something.
[foo.com/index.xml/#bar]
Is technically different, but not different enough to be lucid.
We would need a different type of citation completely to support this. But this is difficult when one of the forms of the citation is plain text. Introducing another symbol just for this would be messy and not lucid.
Another way to look at this is simply to say that inserting or citing a URL is just that, and that URLs already have syntax of sub-resource identification. So, we don't extend SAM compound identifier syntax to external files because URLs already do all that is needed. Interpreting them is, of course, up to the application layer.
Should the serialization be:
<insert>
<nameref value="chapter.separate">
<idref value="ex.separate-1"/>
</nameref>
</insert>
or
<insert>
<nameref value="chapter.separate"/>
<idref value="ex.separate-1"/>
</insert>
The nested form is certainly more expressive, but which is easier to process. The nested form requires iterative processing. The list form requires you to look at each item in the list in turn to determine where the list ends.
We should also consider that when it comes to citations, they can have things nested inside them. Thus
<citation>
<nameref value="chapter.separate">
<idref value="ex.separate-1"/>
</nameref>
</citation>
Can become:
<phrase>
<citation>
<nameref value="chapter.separate">
<idref value="ex.separate-1">
Foo bar
</idref>
</nameref>
</citation>
</phrase
which would make the list form:
<phrase>
<citation>
<nameref value="chapter.separate"/>
<idref value="ex.separate-1"/>
Foo bar
</citation>
</phrase>
Which of these is easier to process? Or should we come up with another form where the processing problem is more constrained? Maybe:
<phrase>
<citation>
<citation-elements>
<nameref value="chapter.separate"/>
<idref value="ex.separate-1"/>
</citation-elements>
Foo bar
</citation>
</phrase>
That would be much cleaner to process. Could do the nested format this way as well, but is there a benefit to it?
Maybe it would be easier still to do it this way:
<phrase>
<citation>
<citation-elements>
<citation-element type="nameref" value="chapter.separate"/>
<citation-element type="idref" value="ex.separate-1"/>
</citation-elements>
Foo bar
</citation>
</phrase>
That way you don't have to deal with different element types to extract all the elements of the reference.
It shouldn't be <citation-elements>
and <citation-element>
as these apply to inserts as well. It should be <reference-elements>
and <reference-element>
.
Can variables be inserted using compound identifiers?
>(#foo/$bar)
I don't see any reason why not at the syntax level. It would be up to the application layer to implement it accordingly.
Related question: should variables be allowed in citations?
[$moby page 12]
Currently this is just seen as a value citation. But if you do this:
[#Melville/$bananas page 12]
You get:
SAM parser ERROR: Invalid compound identifier at: #Melville/$bananas
That is fine in itself, but the consistency may not be apparent to the average user.
Another question about variables in compound identifiers is what positions they are allowed in. What would:
>($foo/*bar)
mean, for instance? Technically it should mean the object with the id bar inside the variable foo. But if you are doing those kinds of tricks I would seriously question the functional lucidity of your system.
You can't ban every possible piece of nonsensical markup, of course, but nor should we try to support cases whose semantics are as dubious as this.
The question remains, then, whether we should detect and raise and error for
[$moby page 12]
We should note that the semantics of this are very different from
[#moby page 12]
The latter is a citation by reference to an object. The former is citation by value with the value filled in by a variable. It is effectively the illegal:
[>($moby) page 12]
We don't allow nested markup elsewhere, so why make a special case for this? The no nesting rule is one of the fundamental simplifying features of SAM markup, and violations for minor cases just because the the syntax would work does not seem appropriate.
While it makes sense that annotations should be wrapped around the phrase text for ease of processing, it does not make as much sense for citations. Rather than:
<phrase>
<citation>
<citation-elements>
<citation-element type="nameref" value="chapter.separate"/>
<citation-element type="idref" value="ex.separate-1"/>
</citation-elements>
Foo bar
</citation>
</phrase>
It would seem to make more sense to do:
<citation>
<citation-elements>
<citation-element type="nameref" value="chapter.separate"/>
<citation-element type="idref" value="ex.separate-1"/>
</citation-elements>
</citation>
Foo bar
</phrase>
This would match how citations work on block quotes.
Addressed in 1f20902624d29dab002353df8374952c63fff81d
By compound identifiers I mean ones that reference one resource inside another, such as a figure inside a particular chapter, as opposed to just referencing the figure itself. SAM does not have any syntax support for this, and it is not clear that it should. Below is a email exchange that explores the issue, which I am capturing here so it can be thought about more:
Hi Richard,
That is an interesting use case that I have not really thought through until now.
SAM does not provide any facility for dereferencing compound identifiers (#chapter.separate/ex.separate-1), but then again, neither does XML. That is purely an application layer thing.
But unlike XML, SAM provide specific syntax for dereferencing identifiers in the form of citation markup. And SAM also makes an explicit distinction between names and IDs at the authoring level, whereas in XML the distinction between ID/IDREF and all other forms of identifier creation and dereferencing that you may decide to invent at the application layer, is not expressed in the markup, but only in the schema. The author does not know, when creating an attribute
id="foo"
if the id attribute is of type "ID" or not. Similarly when they create a referenceidref="foo"
, they do not know if the attribute idref is of type IDREF or not (and therefore if it must be resolved locallaly or not.)The decision in SAM to force the author to choose between an ID with a * and a name with a # was based on my distaste for hidden semantics. If I am forcing the author create things that have different rules, then they should have a different form so that the distinction is clear.
But when I look at the case you propose, it is clear that this approach makes dereferencing of compound identifiers more problematic. If the compound identifier is in the form
Is ex.separate-1 an ID or a name? And based on the principle that the distinction between names and ids is explicit in markup, shouldn't it be:
[ #chapter.separate/#ex.separate-1]
Or:
The current XML serialization of SAM would render the above as:
That is by no means impossible to process. It requires the processor to break apart the nameref value, but any compound identifier is going to require that. On the other hand, the semantics are a bit wonky. This is not really a nameref anymore. The final resource being identified has an ID rather than a name. Still, there is nothing (currently) to prevent the application designer from implementing this in their markup language and its processors.
However, this will only work for names as the first identified resource. The following would produce a parser error:
This is an error because the parse will consider the entire string as an ID, which means it won't match.
In an XML vocabulary it is possible to create a reference to an ID without running into this issue by not declaring the reference to be of type IDREF. In SAM, the citation of a ID is always an IDREF.
Of course, when you are designing your markup language, nothing forces you to use IDs. You can use names for all references, and the current processing will give you free reign to create whatever conventions you like for dereferencing names at the application layer. So I am not sure that there is anything to be gained by creating more explicit support for compound identifiers in SAM.
In any case, compound identifiers are all about namespaces. In SAM (and XML) the namespace of an ID is the current file. In XML, any other identifiers is an invention of the application layer and the application layer can make the scope of its namespace anything it likes. It can make names global in scope but specific to different types of objects, so that a footnote named #foo is different from a figure named #foo or a page named #foo. But this assumes that you have different constructs for dereferencing the names, so that the reference to a footnote is different from a reference to a figure or a page.
SAM's names do not work like that. The dereferencing of a name by a citation makes no statement about what is being dereferenced. Thus
[#foo]
becomes a reference to a footnote because the application layer looks up what type of thing has that name and formats it accordingly. In other words, the format is determined based on the type of the object named, not the type of the reference. This means that SAM names have a global namespace with respect to types. The application layer could decide to restrict their namespace to the file in which they occur, or indeed to any subset of the docset it chooses, but it can't restrict it by type.All of which suggests that it is best to treat SAM names as global in scope in all senses. That global scoping is implicit in the naming scheme that you use for names (and which I have attempted to follow) which is that the first part of the name is a type identifier (#figure.foo").
And if names are global in scope, you don't need to do:
Because the name
#ex.separate-1
is global in scope anyway. Sois all you need to identify that resource.
Because we are working one step back from DocBook, we could fairly construct "Figure 8.1 in Chapter 8" by looking back from the element with the name #ex.separate-1 to its parent chapter then constructing a reference in the form:
The only complication here is that we would have to look through all the files that make up the book to find
chapter[//*[name=$nameref]
. Easy enough to do in XSLT2 (or in SPFE). But this is one of those cases when you have to ask the design question of whether you want the author to do this lookup when they create the reference (and possibly get it wrong or have it grow stale) or do you want to have the build look it up at build time (and thus find it even if it has since moved to another chapter).It is certainly true, though, that SAM's name facility is not as flexible as some of the naming conventions you could invent in a XML-based language where you can use arbitrary attributes to create arbitrary addressing and dereferencing schemes. This is deliberate, because relationships based on arbitrary names don't scale well without big iron CMSs that add complexity and reduce functional lucidity. SAM wants you to manage relationships based on subject annotations as far as possible, not arbitrary names or IDs.
SAM does have one other identifier dereferencing mechanism, however, and that is key citations.
SAM does not provide a mechanism for creating keys or any rules for managing them. That is entirely up to the application layer. One way it might do this would be:
If DITA has taught us anything it is that when you are creating compound identifiers, things will get out of hand as the scale increases and keys will help restore some order (or at least some management potential). Subject annotation is still infinitely preferable, where practical, but keys are sometimes a necessary fallback.
Sorry, that is very rambling, and I don't blame you if you skipped most of it. It is really just me thinking through the issues off the top of my head.
Mark