Add short form encoding alternative for highly predictable and regular ruby

martindholmes / rubyForTEI

Temporary workspace for fleshing out proposal for Ruby in TEI.

Apache License 2.0

2 stars 1 forks source link

Add short form encoding alternative for highly predictable and regular ruby #4

Open duncdrum opened 3 years ago

duncdrum commented 3 years ago

Import from main repo

add a light weight alternative means of encoding ruby, where complexity is not needed and a standoff approach could save ~ 50% of markup.

see HZ04-004-01.pdf. 800+ pages, ruby on every character no irregularities or special cases. Based on proposal

<ab> 
  <r:ruby>
    <r:rb>This<anchor/>be<anchor/>text</r:rb>
    <r:rt place="left">Zis<anchor/><anchor/>tekst</r:rt>
    <r:rt place="right">Kindly<anchor/>regard this<anchor/>letter</r:rt>
  </r:ruby> 
  for you. 
</ab>

Switches the markup logic of the original proposal around, by not specifying anchors and segments in the ruby base. instead, assuming there to be a 1-1 relation ship by default between sequences of rb and rt, only adding markup where this is not the case <supplied>, <group>.

This would greatly reduce the markup load on long regular documents. This is not intended to replace the full fledged nested example, but as an alternative for cases where more light-weight markup is desirable so as to not interfere with other markup.

<ab> 
  <r:ruby>
    <r:rb>This be text</r:rb>
    <r:rt place="left">Zis <supplied reason="symmetry">bee</supplied> tekst</r:rt>
    <r:rt place="right">Kindly <group type="ruby">regard this</group> letter</r:rt>
  </r:ruby> 
  for you. 
</ab>

Question 1: anything speaking against this, given that there is a full fledged means to deal with tricky cases by using e.g. nested <ruby elements.
Question 2: Chunk length. technically the whole 900 page pdf could be captured by something like this:
```
<body>
<r:ruby>
<r:rb>Here be 900 pages of the pdf</r:rb>
<r:rt place="left">and their ruby annotations</r:rt>
</r:ruby> 
</body>
```
We can leave it up to encoders to decide on acceptable chunk lengths. or make a suggestion, or even limit the max length of <rb> via schema (not my preferred option but it exists.
Question 3: where to put <pb/>, <lb/> etc. Can or should they go into rb? Do we want to exclude that possibility?
Question4: aka Martin's nightmare, should we point to the possibility of using &ZeroWidthSpace; U+200B to introduce otherwise invisible word separation

see ab9b7d17cbbd9bca4dae2ab78e86498438a06b14

martindholmes commented 3 years ago

I have to say I really don't like this. I don't see why you shouldn't do it in your project if you want to, but it doesn't look like something I would want to recommend in the Guidelines.

duncdrum commented 3 years ago

Well there are two questions about the &ZeroWidthSpace;: 1) should the Guidelines point to available unicode tools for dealing with certain problems (i'd say yes). and 2) do the guidelines take a stand on if/how/when these should be used. I can completely be convinced that &ZeroWidthSpace; is a horrible idea for most projects, and should be warned against in the Guidelines. You and I have likely different preferences here, but I still think that in response to 1) the Guidelines should provide a pointer, even it its a discouraging one.

sydb commented 3 years ago

On contents of Rubies (and thus @duncdrum’s Q2 & Q3)

I would really like to keep <ruby> as a phrase level element, such that <rb> and <rt> by definition contain only text, comments, processing instructions, and other phrase level elements. While we have a precedent with <app>, <rdg>, and <wit>, I submit we do not have enough experience with that to know whether or not allowing paragraph level or division level constructs inside an element that itself can be a phrase level element is overall helpful or harmful.

That said, I can see no reason to limit the length of a string in <rb> or <rt>. (If within <text>, it will likely be the case that the length of the strings are limited by the length of the paragraph level encoding; if within <sourceDoc> it will likely be the case that the length of the strings are limited by the number of characters on the surface encoded with <surface>.)

I should make clear, though, that I see absolutely no reason to exclude <lb>, <pb>, etc. from <rb> or <rt>.

on ZeroWidthSpace (U+200B, ZWSP; and thus @duncdrum’s Q4)

I don’t see any strong reason to include reference to Unicode characters that should not be used in marked up documents. I suppose U+200B is not shunned the way, say, U+E0001 is; nonetheless, seems to me it is in the same general category of “a character-level way to do something when you do not have access to markup”.

That said, I do not see any strong reason not to include reference to it, either. (Although I would be in favor of getting Ruby into the Guidelines first, and adding a “roads not taken” discussion of U+200B later, if discussion of it were to slow us down. And it is quite reasonable to think such a discussion should be in WD with discussion of all the various other Unicode characters that should be avoided in an XML document.)

On @duncdrum’s Q1

I don’t get it. This “force <rt> to be parallel to <rb>” was not part of the original proposal, was it?

But more importantly, I am not sure I see how this works. How does a processor know that the <group> in the <rt> corresponds to 1 word in the <rb> as opposed to 0 or 3? (I guess if it were 0 it would be a <supplied>, not a <group>?)

(BTW, <group> is not acceptable as the name for this element, we already have a <group> element that is for something very different — but that is just a detail to be haggled out later, if needed.)

duncdrum commented 3 years ago

thank you @sydb i ll prepare another PR, using the contents of the pdf as marked up examples.

If everybody, but me, thinks U+200B, ZWSP; should be warned against, I would welcome the inclusion of this warning in the guidelines. If only to diminish the likelihood of Martin having to deal with nightmarish documents. ;-)

I don’t get it. This “force to be parallel to ” was not part of the original proposal, was it?

No it was not, that's the point. Bopomofo was included in the original proposal, it can use whitespace characters to delimit word boundaries, CJK does not, not even punctuation marks. This doesn't really make much of a difference if you only look at a single term like Billiard Hall but if you want to encode the whole document of this pdf (you can step back in the commit history to get the full 900 pages). My proposal cuts the necessary markup by about half.

The max line-length is 15 characters, so if we use <ruby> to capture typographic lines, each rb has 1-15 characters, and each rt has 1-15 (whitespace delimited) annotations. They need to be parallel with respect to position in the sequence. I used group and supplied ( i also wondered about the gap in the original post) which all seem suboptimal, but close enough for now to communicate the intention. Naming these e.g.: r:rg, r:rs etc. is trivial even with self given deadlines looming. But encoders can make unambiguous parallelism without inserting separators or xml:id in the base, and without <anchor/> with @from @to constructs.

This is not suitable for complex cases, but again here is 900 pages where complex doesn't occur (and there are many more like it). Even within a complex document, I would like to restrict the use of complex markup to where it is absolutely necessary (such as multiple ruby streams annotating overlaying hierarchies of semantic units.)

and shout-out to @747 in the other thread. This document actually doesn't use emphasis markers, instead just the punctuation marks appear next to the characters. One could argue that the punctation marks are actually part of the rt stream, or if they should be encoded as part of the base.

martindholmes commented 3 years ago

Re <lb/>, <cb/>, <pb/>: yes, these are surely required, since ruby-glossed bases do run across these boundaries. I would say that they should appear in both the rb and the rt, but I'm not 100% sure.

martindholmes commented 3 years ago

If everybody, but me, thinks U+200B, ZWSP; should be warned against, I would welcome the inclusion of this warning in the guidelines. If only to diminish the likelihood of Martin having to deal with nightmarish documents. ;-)

I don't think it should be warned-against; I think it should not be mentioned at all. Certainly not in the initial simple implementation/guidance intended for the coming release of the Guidelines.

duncdrum commented 3 years ago

Re <lb/>, <cb/>, <pb/>: yes, these are surely required, since ruby-glossed bases do run across these boundaries. I would say that they should appear in both the rb and the rt, but I'm not 100% sure.

Thank you, in the spirit of encouraging consistent markup, I would suggest to include a code-snippet that shows this, currently the breaks in the various samples here appear outside of <p> <ab> etc. if they can appear inside rband rt, i think we should show this.

As for my Q1 this is significant, I think that the possibility to not use <anchors> as in the original proposal will rarely occur in Japanese documents, but as I've tried to show will be the norm in non-Japanese documents. So this partially hinges on our answer to #6.

@knagasaki what do think about the option to assume a one-to-one relationship by default? If you see the value we can i think rather quickly come up with better elements <supplied> seems to work well, but <grouping> and <gap> do not. I m not 100% sure we need something for <gap>, but we do need something to group strings to map them to a single character. See for example the section on Erhua on wikipedia.

knagasaki commented 3 years ago

Re <lb/>, <cb/>, <pb/>: yes, these are surely required, since ruby-glossed bases do run across these boundaries. I would say that they should appear in both the rb and the rt, but I'm not 100% sure.

Thank you, in the spirit of encouraging consistent markup, I would suggest to include a code-snippet that shows this, currently the breaks in the various samples here appear outside of <p> <ab> etc. if they can appear inside rband rt, i think we should show this.

As for my Q1 this is significant, I think that the possibility to not use <anchors> as in the original proposal will rarely occur in Japanese documents, but as I've tried to show will be the norm in non-Japanese documents. So this partially hinges on our answer to #6.

@knagasaki what do think about the option to assume a one-to-one relationship by default? If you see the value we can i think rather quickly come up with better elements <supplied> seems to work well, but <grouping> and <gap> do not. I m not 100% sure we need something for <gap>, but we do need something to group strings to map them to a single character. See for example the section on Erhua on wikipedia.

The best solution depends on the trend of the programing languages (and recently their libraries) and skill sets of people who are hired in a text encoding project. And the mechanism of the Guidelines provides several solutions implicitly. At least in the initial phase, the Guidelines should not mention such a deep markup based on some small examples. After initial publication and some projects will adopt the elements actually, it should be discussed if necessary in the SIG, TEI-L, Guthub, and so on.

martindholmes commented 3 years ago

The best solution depends on the trend of the programing languages (and recently their libraries) and skill sets of people who are hired in a text encoding project. And the mechanism of the Guidelines provides several solutions implicitly. At least in the initial phase, the Guidelines should not mention such a deep markup based on some small examples. After initial publication and some projects will adopt the elements actually, it should be discussed if necessary in the SIG, TEI-L, Guthub, and so on.

@knagasaki This is exactly how I feel; I think we need to get a simple implementation and a basic prose introduction done soon -- you have all been waiting long enough for this -- and then we can start raising tickets for specific issues which are more complicated. If the current suggested prose is acceptable as far as it goes, I would like to introduce one more example which shows a longer block of text with two or three ruby instances -- there's a nice example in the original proposal (Figure 1), although someone will have to provide the transcription for me because I can't decipher the calligraphy. I think the only thing we should consider adding to the schema at this point is <layout>/@rubyPlace, if that's what we think should be used to specify the default value for <rt>/@place.

knagasaki commented 3 years ago

The best solution depends on the trend of the programing languages (and recently their libraries) and skill sets of people who are hired in a text encoding project. And the mechanism of the Guidelines provides several solutions implicitly. At least in the initial phase, the Guidelines should not mention such a deep markup based on some small examples. After initial publication and some projects will adopt the elements actually, it should be discussed if necessary in the SIG, TEI-L, Guthub, and so on.

@knagasaki This is exactly how I feel; I think we need to get a simple implementation and a basic prose introduction done soon -- you have all been waiting long enough for this -- and then we can start raising tickets for specific issues which are more complicated. If the current suggested prose is acceptable as far as it goes, I would like to introduce one more example which shows a longer block of text with two or three ruby instances -- there's a nice example in the original proposal (Figure 1), although someone will have to provide the transcription for me because I can't decipher the calligraphy. I think the only thing we should consider adding to the schema at this point is <layout>/@rubyPlace, if that's what we think should be used to specify the default value for <rt>/@place.

The default place of <rt>/@place should be above or top in the case of horizonal text and right in vertical text. If it is difficult to switch it according to writing-mode or something like that, please use above (or top). As far as I know in the usage of other element, above seems better.

martindholmes commented 3 years ago

Because the text-directionality is established by the use of @style, we would have to make the default assumption, in the absence of @place, dependent on parsing the @style attribute of the element or the nearest ancestor which has a defined text-directionality. We could describe this in the prose, but the other option would be to add Schematron which requires that /either/ every <rt> have @place /or/ there is a <layout>/@rubyPlace in the header. Of course, we can (and probably should) do both.

knagasaki commented 3 years ago

Because the text-directionality is established by the use of @style, we would have to make the default assumption, in the absence of @place, dependent on parsing the @style attribute of the element or the nearest ancestor which has a defined text-directionality. We could describe this in the prose, but the other option would be to add Schematron which requires that /either/ every <rt> have @place /or/ there is a <layout>/@rubyPlace in the header. Of course, we can (and probably should) do both.

Thank you for your explanation. I hope both would be implemented.