mvh-solutions / nice-usfm-json

Discussions, schema and other artefacts around the dream of a JSON format that respects the USFM spec and that developers don't hate. Too much. Most of the time.
MIT License
7 stars 1 forks source link

A direct conversion from USX to JSON #4

Closed kavitharaju closed 1 year ago

kavitharaju commented 1 year ago

Includes

kavitharaju commented 1 year ago

This JSON isn't what I am proposing or anything. Just started with the very basic USX to JSON conversion to get the discussion started. There are a few suggestions I have on this structure to begin with...

mvahowe commented 1 year ago

@kavitharaju Do we have an example with \cp and/or \ca ? I was answering questions about this over the weekend. The USX way of doing this seems terrible to me, because this information is in completely different places depending whether or not the \cp occurs just after the \c.

Incidentally, and contra the current position of the USFM committee as I understand it, people do use \cp to structure documents, instead of \c, which is not a surprise since the majority of Christians alive think that the \c divisions are wrong! The weekend conversation involved lectionaries, and almost everyone who is interested in producing lectionaries is close to the Catholic or Orthodox traditions. If our JSON is just for lexing, we can probably ignore this, but we still need a consistent way to represent \cp, \ca, \vp and \va. In app-facing models I think we need to nail down the semantics and support operations like "everything within \cp 3b". I'm going to do that in Proskomma but it would be better to agree the semantics more widely, rather than proceeding via de facto standards.

kavitharaju commented 1 year ago

As I understand it, the USX way of handling ca and cp, was to add them as altnumber and pubnumber attribute to the chapter element. That will limit it to be used used only as per the \c based versification structure and not allow a different chapter division for example in the middle of a chapter. I hope that is the concern you are raising, right?

In the new test cases in the USFM/X committee's repo I see a different way they handle this. Here they are treated as separate elements not attributes of chapter element. I hope this is a conscious change they are making ( and not the inconsistency you were talking about). @joelthe1, please correct me if I am wrong.

Since I collected our samples from that repo, our JSON output is also according to that new structure. You can view in in this commit

kavitharaju commented 1 year ago

Have made a few tentative changes to the structure. They are up for discussion and can be reverted/changed if needed.

kavitharaju commented 1 year ago

One issue I noticed in our script is that, whether an object have children or not is determined by the number of items it has in the input USX. That is, if a \p just had one text object, it will be shown as an object without nesting/children. This means inconsistency. I am planning on keeping a list of objects for which we can expect to have nested contents and provide children attribute to them. Any thoughts on this?