ubsicap / usx

Unified Scripture XML
30 stars 6 forks source link

Example of non-decimal verse and chapter ID's #46

Closed Rolf-Smit closed 3 years ago

Rolf-Smit commented 3 years ago

Looking at the documentation the Regex used for chapter and verse ID's, and also the loc attribute found in References uses the following Regex group to parse the part after the Paratext Book ID: [a-z0-9\-,:]

I assume the dash - and double colon : are to separate verses from chapters etc, because the documentation and samples clearly use those. But how is the comma , used? I don't see any example of a verse or chapter ID that uses a comma. I also can't seem to find any example of a verse ID that is non-decimal and uses a-z.

Edit: It even seems that at least for the Reference loc attribute comma's should be avoided/removed, according to the documentation: https://ubsicap.github.io/usx/elements.html#ref

Comma-delimited verses and chapters are split up as much as possible:

Example: Mt 3.4-5,6

becomes: <ref loc="MAT 3:4-3:5">Mt 3.4-5</ref>,<ref loc="MAT 3:6">6</ref>

I assume this is only the case for references since multiple can easily be added? So comma's are allowed in chapter and verse ID's?

Would be really nice if some samples could be added!

klassenjm commented 3 years ago

Hello @Rolf-Smit ,

Thank you for posting your observation. I believe you are correct - the comma should not be in this regex, and would not occur in an sid,eid,or loc. The example you quoted about comma delimited verses is good support - the separate components being broken apart. I cannot recall a reason for the comma being included, and unfortunately the history in this repository does not go back prior to the USX 2.6 schema.

A correction has been posted to the schema and the docs.

Jeff

Rolf-Smit commented 3 years ago

@klassenjm thanks for fixing this! But I think I found some more inconsistencies/issues.

One sample shown here: https://ubsicap.github.io/usx/master/elements.html#ref is this one:

<ref loc="MAT-LUK">Mt—Lk</ref>

However the Regex does not seem to allow for book ranges: [A-Z1-4]{3} ?[a-z0-9\-:]*

Linking Attributes in USFM as described here: https://ubsicap.github.io/usfm/linking/index.html#general-syntax do also not seem to allow for book ranges, and the Regex there still contains the comma: [A-Z1-4]{3} ?[a-z0-9\-,:]* and same thing is true for the /xt marker in USFM: https://ubsicap.github.io/usfm/notes_basic/xrefs.html#xt. The link-href default attribute that can be used for the /xt tag also uses the Regex that includes a comma: [A-Z1-4]{3} ?[a-z0-9\-,:]*.

TLDR:

klassenjm commented 3 years ago

@Rolf-Smit Thank you very much for sending these notes. It is appreciated. I will review and make changes as needed, as soon as possible.

klassenjm commented 3 years ago

Moving USFM specific items.