ubsicap / usfm

Unified Standard Format Markers
39 stars 18 forks source link

Formal grammar #85

Closed jaakristioja closed 5 years ago

jaakristioja commented 5 years ago

I'm having trouble understanding the exact grammar for even the very basic USFM, and I think that the specification is very vague at this. For example, I don't understand whether a USFM file must immediately start with a \ or not, what is the overall structure of the file (whether some part of the file is considered to be a header, a body etc), where exactly can certain markers occur (e.g can one use \ide switch the character encoding mid-document?), whether certain identification markers are compulsory or optional etc.

Would it be possible to amend the specification with more formal grammar rules, e.g. written in BNF, EBNF or similar? This would make the specification far less ambiguous, and easier for developers like me to write correct parsers.

Thanks!

cmahte commented 5 years ago

If you look in the usfm.sty files that are usually available wherever the spec is, The style sheet contains a bit more information about where each tag is valid. They have an Occursunder field. This should help guide you.

These Occursunder fields aren't spec'd because they are customizeable, but if you design for the default, anyone who's using a custom.sty file typically already knows customizing the stylesheet puts them outside of formal expectation of of full support.

cmahte commented 5 years ago

But an usfm file does always start with the \id tag. and the id tag must always have the 3 character book code immediately following \id . This should be (was at one time) in the specification.

However, you CAN have multiple \id lines in a single usfm file, and the usfm remains valid. This isn't specified as required or not, but my testing and queries on the subject suggest there is nothing invalid with a 2nd or 66th \id field in a single file.

cmahte commented 5 years ago

I agree with Jak that at least the introductory tags \id \periph \usfm \ide \h \toc1,2,3 should have a more formal order defined in the USFM specification. There's no reason not to do so, and not having a defined order makes parsing files much more complicated.

jaakristioja commented 5 years ago

I'm not sure I understand the \OccursUnder logic, nor the exact relation between these style sheets and USFM. The analogy which comes to my mind is HTML and CSS, where CSS only specifies some additional presentational properties for the document. But \Marker, \Name, \Description, \OccursUnder, \Rank etc in these style sheets seem to indicate that these style sheets are more to USFM than CSS is to HTML.

Is there a specification for the style sheets as well? I was unable to locate a reference to it in the USFM spec.

Would something like the following would be valid USFM?

\id MAT Doesn't matter what I write here, because
\id GEN the specification doesn't seem to specify
\id GEN a strict format for these strings after the <CODE>.

A good formal grammar for USFM could rectify most such ambiguities (but not all).

cmahte commented 5 years ago

As far as I know, that is valid USFM.... unless both GEN sections contain the same chapter number in them. I think any duplicate pre chapter 1 material (any tag except a \c coming after the \id ) would make this a duplicate book as well:

\id GEN
\h Genesis
\mt1 Genesis
\id GEN
\c 1
\id GEN
\c 2

Is valid but

\id GEN
\mt1 Genesis
\c 1
\id GEN
\mt1 Genesis 
\c 2 

is not valid. And nor is

\id GEN
\c 1
\v 1
\id GEN
\c 1 
\v 2
etc.

nor is

\id GEN
\c 1
\c 1

Any repeated id + c chapter tag invalidates the file (chapter zero included: the introductory stuff).

However, I don't represent any official USFM body. Any comments that disagree with this likely carry more weight than my understanding.

klassenjm commented 5 years ago

@cmahte Thank you for your helpful responses for Jaak.

I agree that the current documentation is not sufficient as a grammar. As Michael mentions, the usfm.sty stylesheet contains some additional definition, and is as suggested more than what CSS is for HTML.

I have added a basic description of stylesheet properties to the sty folder in README.md. Take a look there.

Also, in case it assists, let me refer you to a more formal grammar for use in checking USFM 3 content which is being developed here.