USFM Spec Syntax and Semantics

mvahowe commented 1 year ago

This feels to me like an important issue. Syntactically, the spec has 'types' including 'paragraph' and 'character'. But I think that's a syntactic thing and, since JSON has its own syntax, I don't think we need to worry much about exposing that syntax via JSON.

For semantics, I largely took my cues from the spec chapters. So, eg, why did I treat \toc differently to \mt?

\toc is defined in the "Identification" section. In other words, it's considered in the spec as metadata. I think that the example for \toc makes this clear - they show USFM with three \toc values and then show how only two of them is rendered. (I think that example is quite contrived as, in almost every case, people pick one. And since there's absolutely no consistency as to which toc levels if any are provided, and if a \h is provided, robust code ends up iterating over all the options to pick one according to some arbitrary order. For that, a key-value structure seems like the natural JSON representation.

\mt is in the "titles, headings and labels" section. (Note that the spec makes a distinction between titles and headings.) It says

Major title.
The key components in the title of a biblical book.
The variable # represents a portion of the title, with the lesser emphasis (relative weighting) being on the higher numbers.

The spec states quite explicitly that the mt tags are portions or components of a whole, and that "the title" is all of them together. That's why Proskomma and SOFRIA represent all the mt tags as multiple blocks in one sequence. The example does render the mt tags literally, unlike the toc example. So it makes no sense whatsoever to look up "all the\mt2 tags" unless you are intentionally trying to make word salad. According to the spec, the mt tags form a multi-block whole, and in SOFRIA that's a sequence of type "title".

SOFRIA has different markup for character/word-level tags (wrappers) vs milestones (milestones) because the spec considers these to be different classes of tags, by putting them in different chapters of the spec.

In general, I think that we should be aiming for semantic equivalence to USFM, not paragraph/character equivalence. And, also, I think we need to keep reminding ourselves that the original semantics of USFM are about dead-tree publishing, not using in-memory trees as a cheap database.

mvahowe commented 1 year ago

(It is true that the USFM spec has no equivalent to my mark, ie an empty milestone. However the spec does say

Currently, USFM does not formally provide any standalone milestones. This may change with future updates to USFM 3.x, as use of milestones highlights specific needs.

so my story is that I'm just ahead of the curve here :-) Regardless, it isn't hard to turn the marks into start/end milestone pairs or a wrapper. And they do unclutter the markup.)

kavitharaju commented 1 year ago

Even though there are semantic differences in markers, while proposing a data structure we could try to bring in some uniformity in how we form the objects. Simple key-values would definitely make lighter json. But we cannot use that through out as we have more structural and semantic information to be convied by our representation. If all components follow some similarity in terms of object keys and value types, it will serve as a better general purpose model.

If we consider some components as metadata and others as more important data, I would say, that is very usecase specific. For a different use case even id or c could be metadata. Say AI, there text in introduction, peripheral, or sidebar could be as important as text in a chapter. So forming data models differently for different components will not be a good approach. The more we focus on usecases we will end up building more and more JSONs, which is moving further away from the concept of a standard.

About introducing more concepts and terminologies like mark, grafts, sequence etc, we need to make things simpler for the user rather than expecting them know all about USFM and more. So I will be leaning more towards abstracting the finer features of USFM markers as much as possible by generalizations and common formats, and trying not to bring in more concepts than that is necessary for a basic data representation model.

I can definetly see in the nature of USFM markers that those were meant for typesetting. But I see USFM is not just used for that original purpose, but has been adopted as a format for data representation and even interchange. And that is where I think is the relevance of alternate formats like XMLs and JSONs. So while defining a standard we could, not just focus on what was the original usecase but also on how it is and will be used.

mvahowe commented 1 year ago

If all components follow some similarity in terms of object keys and value types, it will serve as a better general purpose model.

I think we need to decide what "general purpose" means here. I designed SOFRIA with Scripture in mind. If that's the aim, I think it's ok to have Scripture-specific features. Yes, you could make everything work in a more generic way, just like you can build entire websites using just div and span. But it turns out that there's real value in knowing that your div is a paragraph in the "real" sense, or in knowing that a list item is a list item rather than a div with a particular collection of CSS attached to it. I think that the same goes for Scripture markup.

I do want to support other types of content, but I plan to do that by adding other structures and then constraining Scripture to the USFM subset of those content types. If USFM really has to handle anything that anyone has ever wanted to type into Paratext, I think we should just pick some other completely generic document format and have done with it.

kavitharaju commented 1 year ago

The purpose of the JSON, as I understand, would be to faithfully represent the contents of USFM. We can try to represent inherent semantics already present in USFM, but not add to or change it. Also not prioritize or de-prioritize USFM components basis on whether they are to be printed or not.

By general purpose, I don't mean to go beyond the scope of what USFM has. For that we can resort to keeping the schema extendable, as you said.

mvahowe commented 1 year ago

@kavitharaju I think this is the fundamental point on which we need to either find agreement or realise that we're trying to do incompatible things.

You seem to be taking "the content of USFM" to mean something at a syntactic level. So USFM paragraphs should be paragraphs and USFM non-paragraphs should be something else. I think all that is an artefact of a long-obsolete syntax that no longer even exists once we move to JSON. If someone needs, basically, USFM with curly parentheses, that's fine with me. But I honestly don't know who outside of the standards committee is going to want to use it, since that approach is incomprehensible unless you've read the entire USFM spec and understand that, eg, "UTF8" is a paragraph when, for the rest of the world, it clearly isn't.

The semantics I'm interested in are the ones expressed, informally, in the words and examples of the spec.

So, eg, when the spec says An optional character encoding specification for \ide, I don't see any reference to paragraphs, or any rendering example, and this text is in a chapter called "identification", so I assume the semantics are that this is metadata that helps to identify the document rather than part of the document body, and that we therefore want a representation that is convenient for metadata-type usage.

When I see normal paragraph for \p I assume it's a "real" paragraph, especially as it's in the chapter called "paragraphs" (which shows that the creators of the spec draw a clear distinction between "quirk of USFM syntax" paragraphs and "real" paragraphs, since many syntactically-paragraph tags are not in the "paragraphs" chapter of the spec.) And so on.

Transferring all the weirdness of USFM syntax into a completely different syntax isn't going to give us JSON that anyone wants to use, in my opinion. Maybe I'm wrong but, regardless, I'm trying to produce JSON that works for devs who have never read the USFM spec. If there's a middle ground so that one JSON representation works for more people, that's great! But I don't think I could sell the kind of approach you're suggesting, undiluted, to my existing user base.

As for which JSON is easier to process, I think we could settle this empirically with a hacking competition :-)

kavitharaju commented 1 year ago

Let me clarify what I mean when I say paragraph.

Its not the paragraph-type as in the way USFM classifies all markers as paragraph-type, character-type etc.

It is the collection of markers that carry text contents as listed here, normal paragraph as you put it. Other than those, lists, tables and poetry will have to be treated same way, but not markers like ide.

mvahowe commented 1 year ago

@kavitharaju Above you say

Simple key-values would definitely make lighter json. But we cannot use that through out as we have more structural and semantic information to be convied by our representation. If all components follow some similarity in terms of object keys and value types, it will serve as a better general purpose model.

I read that as an argument for treating ide the same way as "real" paragraphs, rather than providing metadata as key-value (which is how any dev would do it if they were starting from scratch).

kavitharaju commented 1 year ago

Yes, it is an argument to treat "all" components alike (whether paragraph-type, normal paragraph, character-type or wrappers) with similar json objects so that a developer can expect what keys to find when he gets an object. In our JSON, for example, ide will be like this

{ tag: "ide", type: "header" , value: "utf-8"}

and a paragraph p will be like this

{tag: "p", type: "paragraph", children:[...]}

The semantic difference between them denoted in the value of type / category field and the presence of nesting structure in one using children. There are only a hand full of keys we use through out: tag, type, value, and children(and additionally ref). So the level of expertise required to get started is very low for a newbie. For those willing to learn and understand detailed semantics of markers, we do preserve all info the USFM had.

mvahowe commented 1 year ago

I think we're disagreeing about what is hard. I'm hearing your argument like the argument that C is the easiest language to learn because there are so few basic commands and so everything is orthogonal. But the experience of most people with C is that everything is very hard (maybe in an equitable way).

I don't think that the difficulty is at the level of one attribute or three. The difficulty is knowing how to process that information, and making everything look like everything else provides no help with this. I saw a real case recently where someone was publishing \rem as canonical content in an app. Syntactically, with your approach, how does someone know not to do that?

kavitharaju commented 1 year ago

Regarding C, given sophistication and simplicity I think I will pick simplicity. May be its in my personality. I believe in the power of simplicity.

It will give freedom and flexibility to others to build on top of our format and solve their problems the way they want it. They may not be using the exact format, but extended formats or even converted formats. It is not preventing people from being as complex as they want or treat same data in diverse ways, depending upon their usecase. But the difficulty level will not scare them away and rigidity will not prevent customization.

Regarding \rem, (\sts and \lit too), the type/category, being "comment" will help the user get rid of them if he doesn't need them. But for a project management app for translation, they would be important. So, lets not down-grade them in our representation, because it is not important in publishing, .

mvahowe commented 1 year ago

I haven't downgraded rem, I've put it in a different place to canonical scripture. By my definition of "simple", that's simpler than having everything in one long laundry list of paragraphs, half of which are not paragraphs.

(I'm thinking now of an ancient computer magazine April Fools language called C+-. One of the features was "All the code goes on one line because that way you know which line has an error.")

alilland commented 1 year ago

should the original XML usx version be preserved in the final output of the USJ when the source comes from a parsed XML?

mvahowe commented 1 year ago

As long as the original USJ shall be preserved in the USX and indeed the USFM when we convert in the other direction.

mvh-solutions / nice-usfm-json

USFM Spec Syntax and Semantics #2