Fix USJ schema and the way altnumber and pub number are handled

kavitharaju commented 7 months ago

This PR includes

Minor updates in the USJ schema (as per discussion in #62 )
- [x] Make the version number there same as that in USJ testsuite samples
- [x] Set 'marker' field as 'required'
[x] Keep ca, cp, va, vp objects as separate elements in USJ like it is in USFM (https://github.com/usfm-bible/tcdocs/commit/866440e493491504649358bf87d9ea5e5cd6e2c9)
- [x] The code changes for this is copied to python/lib/usjproc.py as well
- Reasoning behind it
  - They are separate char and para type markers in USFM, though not so in USX
  - They are sometimes seen to be used in between chapters and verse text, making it hard to tie it properly to a chapter or verse object. In this context, USX also seem to handle them as separate objects
[x] Version bump for USJ to 0.2.4 in schema, script and testsuite samples, including the following changes:
- White space handling in previous PR #65
- Make 'marker' filed required in USJ schema
- Keep altnumber and pubnumber fileds as separate objects

kavitharaju commented 7 months ago

@mhosken While working on this, I couldn't find the code portion corresponding to this function https://github.com/usfm-bible/tcdocs/blob/9203ce01c89b410ff78b6a4683255ef655340480/python/scripts/usx2usj.py#L60C1-L67C23 in the python/lib module. Is there a place where the root object of USJ is formed and version number is set there?

mhosken commented 6 months ago

In response to the question of missing code in usjproc.py: guilty as charged. Sorry that got dropped accidentally. Do you want me to add it back in or do you want to do it? I would also refactor usx2usj to use usjproc rather than repeating code. In fact I would suggest refactoring the use case for usx2usj to use usfconv (which does any serialization to any other serialization) and do away with usx2usj completely.

Looking at this PR, I would suggest that this is not a good way to go regarding \vp and \va. Yes \vp is ambiguous in that it can occur as a way of tagging the published form of a verse and also it may occur as a simple character style. I would suggest that USX is stronger here and keep the information as attributes of the verse. This doesn't preclude also having character runs of type vp.

Another reason for not wanting to do this is that the simpler you can keep the mapping between USJ and USX, the better. Every special case is more expensive than a few lines of code, you have to document it and every implementation has to track that special case. It's why I work so hard to keep special cases out of the USFM parser/generator and keep it all in the grammar file.

If you still feel strongly that you do want to follow USFM here, you also need to write the corresponding code to parse the sequence in USJ back into attributes in the USX data model.

kavitharaju commented 6 months ago

Our motivation for treating va and vp so, was to avoid "the special case" already present in USX in the way it keep it as attribute in one occasion and new object in another. Wouldn't that be expensive for a tool working on USJ independently of USX ?

mhosken commented 6 months ago

I think you have a special case whichever way you approach it. The advantage of keeping the attributes is that you are closer to the content model and the 'other' case is also a normal case (just another character style). I.e. the model and conversion is simpler. If you go with the USFM model for these, you have the same pain that the USFM processing has of explicitly handling these during conversion.

I don't see a value in users of USJ having a single way to handle vp whether it is being a published verse or merely a character style. The two contexts are dissimilar enough to warrant separate handling. (Why do we allow vp as a character style anyway?)

mvahowe commented 6 months ago

Why do we allow vp as a character style anyway?

Because it is impossible to typeset many ecumenical Bibles without it.

mvahowe commented 6 months ago

Preface to Sirach, NFC:

\p \vp (1)\vp Les livres de la \w Loi\w et des \w Prophètes|Prophète, prophétesse, prophétie, prophétiser\w nous transmettent de nombreuses grandes leçons, \vp (2)\vp de même que les autres Écrits qui les suivent

There are 35 "verses" before chapter 1. How would you like to do this without a vp character style? Or, alternatively, are you going to hold the 1.0 spec pending a discussion with the Vatican?

Below, printed examples of French Bible Society NFC and TOL (the official French catholic translation).

IMG_20240328_155547943 IMG_20240328_155514766

mhosken commented 6 months ago

Thanks for the examples. Don't worry there are no plans to do away with the char style. I was merely sharing my ignorance. BTW even if we did decide to do away with the vp char style WHICH WE ARE NOT, we would keep it supported, if deprecated, until it really isn't around any more. IOW, Don't Panic.

On Thu, 28 Mar 2024, 14:52 Mark Howe, @.***> wrote:

Preface to Sirach, NFC:

\p \vp (1)\vp Les livres de la \w Loi\w et des \w Prophètes|Prophète, prophétesse, prophétie, prophétiser\w nous transmettent de nombreuses grandes leçons, \vp (2)\vp de même que les autres Écrits qui les suivent

There are 35 "verses" before chapter 1. How would you like to do this without a vp character style? Or, alternatively, are you going to hold the 1.0 spec pending a discussion with the Vatican?

— Reply to this email directly, view it on GitHub https://github.com/usfm-bible/tcdocs/pull/66#issuecomment-2025421690, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLMO3O2QF6FSXRJLESO7ALY2QOB7AVCNFSM6AAAAABD3MN632VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRVGQZDCNRZGA . You are receiving this because you were mentioned.Message ID: @.***>

mvahowe commented 6 months ago

French TOB (major ecumenical Bible showing why you can't sidestep the issue by treating the prologue as a USFM introduction, cos there's an actual introduction. (And, also, the prologue is (deutero)canonical text, so putting it in an introduction is akin to mistaking \s for \d and also a recipe for annoying non-Protestants.) IMG_20240328_191816719

mvahowe commented 6 months ago

IOW, Don't Panic.

Sorry, cross-posted comment.

I'm not panicking because, one way or another, everyone will keep doing the right thing with Bibles. The only risk is that they ignore any standard that makes that harder, or that they prefer any de facto standard that makes that easier.

If, today, you proposed to a room full of technicians some new standard with two completely different ways to represent exactly the same thing, the response would be somewhere between laughter and derision. That's precisely what USX does in this case. It's just one example of decisions with USX 3.0 in particular that, starting from scratch, would look like an obfuscated code joke. I get that it's hard to roll back those decisions for USX. But insisting on backwards compatibility with stupid, for eternity is... not guaranteed to drive adoption.

jonathanrobie commented 6 months ago

USX and USJ are both serializations of a single data model. If we need to change this in USJ, we should change it in USX at the same time. I do mot yet have a strong feeling whether we should make this change, but I feel strongly that we should either make the change in both USX and USJ or not make the change.

jonathanrobie commented 6 months ago

How does the TOB currently do this? Was the TOB written in Paratext? Was it published via USX? What does the markup look like in USFM / USX?

mhosken commented 6 months ago

If I am understanding correctly, the concern is about the two different uses of \vp in USFM. I will call the first, the parameter usage and it is modeled in USX by a parameter. The other use I will call the character style use and it is modeled in USX by a element with a @style="vp". What is immediately noticeable is that in USX, the two uses are clearly distinguished, with one using an attribute and one a styled element. Since this is the core data model for the standard, we start from that position and consider how each of these are serialised in the various formats.

In USX the two uses are simple and clear: @pubnumber (even though it doesn't have to be a number) and . In USJ we are recommending that the same model be used. USFM, on the other hand has difficulties because we don't want to add attributes to \v and \c. Instead we use magic character styles in a fixed position to the \v and \c to serialise the attribute. If we were starting afresh on USFM, we would do something different, but that magic character style is labelled \vp.

If we decided that we really didn't want to follow the USX model. Then we would need to change the USX model to use in both cases, to more directly represent the USFM representation. We don't think this is the best solution and recommend sticking with the existing USX model. And hence the request for USJ to directly represent the USX model rather than the USFM serialisation model.

mvahowe commented 6 months ago

And hence the request for USJ to directly represent the USX model rather than the USFM serialisation model.

I think it's ambitious (putting things politely) to call this issue "the USX model". It's an accident of XML syntax and of the Paratext internal processing model, before USX became a standard, a history which the committee claims to have put behind it.

I'm not seeing the two uses. In one case you are overriding the underlying versification and in the other case you are too. If I need to I can find you examples where both these forms happen in the same paragraph and for the same reason.

Last time we went around this, the conversation ended with use of vp to reorder partial verses in Zechariah, and with the committee's answer that Bible scholars needed to change their translation to fit the committee's markup. I still think that's not how things are supposed to work. vp is used in all sorts of ways in huge numbers of documents. You can't retrofit constrained semantics to the world's existing documents and expect those documents to still work as they did before. This is epistemology meets Tenet the film.

How does the TOB currently do this? Was the TOB written in Paratext? Was it published via USX? What does the markup look like in USFM / USX?

That was my own copy of TOB which I believe to be the most recent tradition. It certainly exists in Paratext, I'm not sure if it was translated or originated that way but given UBS involvement I would think that it was at least translated that way. I don't think it's in DBL so it probably hasn't been published in USX. I don't have access to the markup, isn't there someone from UBS on the committee?

mvahowe commented 6 months ago

I'm not seeing the two uses. In one case you are overriding the underlying versification and in the other case you are too.

From French NFC (UBS):

<chapter number="1" style="c" sid="SIR 1" />
  <para style="ms1">PRÉFACE DU TRADUCTEUR GREC</para>
  <para style="p">
    <verse number="1a" style="v" pubnumber="(1)" sid="SIR 1:1a" /> Les livres de la <char style="w">Loi</char> et des <char style="w" lemma="Prophète">Prophètes</char> nous transmettent de nombreuses grandes leçons, <char style="vp">(2)</char> de même que les autres Écrits qui les suivent<note caller="+" style="f"><char style="fr" closed="false">PRÉFACE (1-2) </char><char style="fq" closed="false">Les livres de la Loi… les suivent </char><char style="ft" closed="false">: ou </char><char style="fqa" closed="false">La Loi, les Prophètes et les autres auteurs qui les ont suivis nous transmettent… </char><char style="ft" closed="false">– Le traducteur grec du <char style="bk">Siracide</char> mentionne ici les trois grandes parties de l'Ancien Testament hébreu ; voir </char><char style="em">La Bible, son unité, sa formation, son texte</char>.</note>. <char style="vp">(3)</char> Il faut donc féliciter le peuple d'Israël pour son instruction et sa sagesse. <char style="vp">(4)</char> Mais on ne doit pas seulement lire ces écrits pour devenir savant. <char style="vp">(5)</char> Ceux qui aiment s'instruire doivent être également capables d'en faire profiter les non-initiés, <char style="vp">(6)</char> et cela aussi bien par leurs paroles que par leurs écrits.</para>

@mhosken @jonathanrobie What different use cases are you seeing between

<verse number="1a" style="v" pubnumber="(1)" sid="SIR 1:1a" />

and

<char style="vp">(2)</char>

? What deep semantics am I missing here? In the first case we're printing a number in brackets and in the second case we are too. In the first case we also make 30 or so verses a whole partial verse, which is a horrible kludge of which the Bible tech world should repent but, regardless, on what logical basis does that kludge need to be syntactically connected to one of 30 or so places where we want to add a number in brackets?

jonathanrobie commented 6 months ago

And hence the request for USJ to directly represent the USX model rather than the USFM serialisation model.

I think it's ambitious (putting things politely) to call this issue "the USX model". It's an accident of XML syntax and of the Paratext internal processing model, before USX became a standard, a history which the committee claims to have put behind it.

Actually, we are creating a formal model of the language, something which did not exist previously. For the first time, we have:

A formal definition of the language
A reference parser
Test suites
Railroad diagrams in the specification to explain the mappings to users
Serializations to USFM, USX, and USJ, using the same underlying data model for interoperability across serialization formats

That's something we care about, one of the main reasons we are doing this work in the first place.

We can change the USX representation if that's the right thing to do. I don't think it makes sense for USJ and USX to be gratuitously different. I think we would do well to focus on what the internal model should be for this USFM markup and reflect our answer in both the internal model and serialization to USX and USJ.

jonathanrobie commented 6 months ago

@mhosken @jonathanrobie What different use cases are you seeing between

<verse number="1a" style="v" pubnumber="(1)" sid="SIR 1:1a" />

and

<char style="vp">(2)</char>

This is what I care about most: USFM can express both of these things, so we have to give them each an interpretation in our model. USX and USJ should each follow that interpretation.

But I think there's a significant difference between:

A verse marker, and
A character style marker

In the first case we're printing a number in brackets and in the second case we are too.

The print formatting does not define the semantics of the underlying markup. I am not (yet) sure that I know whether anything needs changing in our model, but I would resist any change that was based on print formatting rather than well-defined semantics for each marker.

I think you are proposing a change to our semantics. Can you be more clear about what that change is?

KentSpiel commented 6 months ago

Somewhat confused by @mvahowe's objections since I do not work in USX much or USJ at all. But the fact that \vp ...\vp* can be either a character style or an attribute on a verse does seem strange to me. Why can't vp always be an attribute on a object? This would obviate the need for a verse 1a to hang the first vp on.

<verse number="" style="v" pubnumber="(1)" sid="SIR 1:0" />

and likewise for all the rest

<verse number="" style="v"pubnumber="(2)" sid="SIR 1:0" />

That said (and I am just speaking from what seems logical to me) I would not put the Prologue in Chapter 1. I feel it should be either explicitly or implicitly in Chapter 0.

Implicit

<para style="ms1">PRÉFACE DU TRADUCTEUR GREC</para> <para style="p"><verse number="" style="v" pubnumber="(1)" sid="SIR 0:0" /> Les livres de la <char style="w">Loi</char> et des <char style="w" lemma="Prophète">Prophètes</char> nous transmettent de nombreuses grandes leçons, <verse number="" style="v" pubnumber="(2)" sid="SIR 0:0" /> de même que les autres Écrits qui les suivent, . . . </para> <chapter number="1" style="c" sid="SIR 1" />

The USFM would be:

\ms1 PRÉFACE DU TRADUCTEUR GREC \p \vp (1)\vp* Les livres de la \w Loi\w* et des \w Prophètes|Prophète\w* nous transmettent de nombreuses grandes leçons, \vp (2)\vp* de même que les autres Écrits qui les suivent, . . . \c 1

Explicit

<chapter number="0" style="c" sid="SIR 1" /> `

PRÉFACE DU TRADUCTEUR GREC` ` Les livres de la Loi et des Prophètes nous transmettent de nombreuses grandes leçons, de même que les autres Écrits qui les suivent, . . . ` ` ` The USFM would be: >\\c 0 >\\ms1 PRÉFACE DU TRADUCTEUR GREC >\\p > \\v 1 \\vp (1)\vp\* Les livres de la \\w Loi\\w\* et des \\w Prophètes|Prophète\\w\* nous transmettent de nombreuses grandes leçons, >\\v 2 \\vp (2)\vp\* de même que les autres Écrits qui les suivent, . . . >\\c 1

mvahowe commented 6 months ago

@jonathanrobie

This is what I care about most: USFM can express both of these things

Which two things? In terms of output and in terms of any user-comprehensible semantics I can think of, the two things are

Arbitrary verse-like text output, eg '(1)'
Um, arbitrary verse-like text output '(2)'

USX has two ways to describe exactly the same thing. There's no extra expressivity that I can see. If you marked up v1 with a character style it would mean exactly the same thing. Also, does the schema stop me from doing exactly that?

@KentSpiel Is \c 0 legal? It's an honest question. I'm almost certain that it wasn't a decade ago because Paratext expects chapters and verses to count up from 1. The more extreme case is Greek Esther where UBS translations often have chapter 1 before chapter 1 and two chapter 3s separated by chapter B. A common use case for \cp and \vp is printing creative versification while allowing Paratext to pretend that every Bible in the world looks a lot like KJV.

(There's an equivalent potential issue with v0, but that "just works" since English speakers care about this. So, in Psalms, you can have canonical text, typically canonical titles, before v1. Several deuterocanonical books need that functionality, but for chapters.)

KentSpiel commented 6 months ago

No I don't think \c 0 is valid USFM. At least it would not work in Paratext, but that does not mean it couldn't be Valid. One would need to allow a chapter 0 in the project's versification. In other words it's a question of data integrity not structural integrity. That said, I don't think chapter 0 needs to be explicit. Like verse 0 it can be implied.

mvahowe commented 5 months ago

No I don't think \c 0 is valid USFM. At least it would not work in Paratext, but that does not mean it couldn't be Valid. One would need to allow a chapter 0 in the project's versification. In other words it's a question of data integrity not structural integrity. That said, I don't think chapter 0 needs to be explicit. Like verse 0 it can be implied.

We're way off the PR now, and I don't think there's an easy fix for the wider non-protestant versification issues. Chapter 0 probably should be "implied" since, like verse 0, no-one wants to print a zero in their Bible. The difference is that chapters contain verses, and many things break if you start typing verses before any chapter number. Off the top of my head you'd end up with all your ch0 content as part of mt1 or something.

Really, my only point here is that the USX way of representing the same vp information in different ways looks like an error, probably is an error, and therefore shouldn't be propagated into new standards such as USJ.

jonathanrobie commented 5 months ago

We're way off the PR now, and I don't think there's an easy fix for the wider non-protestant versification issues.

I agree that a PR is the wrong form for discussing this. Perhaps a shared doc would be better?

Really, my only point here is that the USX way of representing the same vp information in different ways looks like an error, probably is an error, and therefore shouldn't be propagated into new standards such as USJ.

If there is an error that we need to fix, I think we need to fix it in both USX and USJ. A pull request that changes just USJ does not do that.

But I think that starts with a clear shared understanding of the problem that needs to be fixed. I don't think we are there yet. I think a shared document would help:

What is the problem - illustrated with marked up examplesat least
What solution do you propose - and I think we need the same solution for USX and USJ
What other solutions should we consider together?

If we agree there is a problem, we should find a solution to it. It may or may not be this one, but I think it should be the same for both USX and USJ.

kavitharaju commented 5 months ago

Have sent a new PR with changes other than the vp related ones.

This PR could be kept as WIP until we make the required decision regarding it in USX( or the underlying data model).

jonathanrobie commented 5 months ago

I have created a shared document to help us understand the use cases and requirements that Mark and Kavitha have mentioned:

https://docs.google.com/document/d/1tBsihIxD8WBR6nFTmR9xPd98CepOPuK1U2b6leZgXiY/edit?usp=sharing

Can we discuss it there? I'm not convinced I understand the issues yet.

usfm-bible / tcdocs

Fix USJ schema and the way altnumber and pub number are handled #66

Implicit

Explicit