schierlm / BibleMultiConverter

Converter written in Java to convert between different Bible program formats
Other
124 stars 33 forks source link

USX 3.0 support #38

Closed Rolf-Smit closed 3 years ago

Rolf-Smit commented 3 years ago

First of all, thanks a lot for this extremely useful library.

Today I tried to convert a USX 3.0 file to USFM, but it seems BibleMultiConverter only supports USX 2.6. I tried updating the schema after converting it from Relax NG Compact to XSD using Trang, but my knowledge about XML and schema's is just lacking, and the converted schema contains a lot of errors, such as Unique Particle Attribution violations.

Anyhow, before I dive in deeper, do you have any plans to support USX 3.0? Java is no problem for me, however I usually don't work with XML, as a mobile developer XML is just not really part of the skill set I guess.

Is updating the schema even enough? Or does this also require a complete rewrite of the USX class and Usx class? I can imagine it does?

Update: I'm now working on adding USX 3.0 Import support and export for USFM 3.0. Once this works I might also start working on Export for USX 3.0 and Import for USFM 3.0.

Branch: https://github.com/Rolf-Smit/BibleMultiConverter/tree/feature/usx-3.0

Progress:

Rolf-Smit commented 3 years ago

Update:

I did some more digging and found that converting the Relax NG file from https://github.com/ubsicap/usx to a XSD file is not really possible in an automated way, using the most recent version of Trang it still requires many manual fixes.

The main difference between USX 3.0 and 2.5/6 is mostly the addition of the "Peripheral" feature set. For my requirements I only need to convert files that solely use the "Scripture" feature set.

I would like to make an opensource contribution to this project, however I'm not sure how to proceed, I can do a few things:

  1. Manually convert the Relax NG file to XSD including the "Peripheral" features. However I'm not even sure if the internal format of the BibleMultiConverter supports Peripheral parts. Then create a new AbstractParatextFormat implementation (USX3) for the conversion from and to the internal BibleMultiConverter format.
  2. Manually convert the Relax NG file to XSD but only with support for the "Scripture" feature set, then create a new AbstractParatextFormat implementation (USX3) for the conversion from and to the internal BibleMultiConverter format, but only with support for the "Scripture" feature set.

Note: Instead of converting the most recent Relax NG file for USX 3.0, I could also copy and alter the current usx.xsd file.

What do you think?

schierlm commented 3 years ago

Hello Rolf,

welcome and thanks for your willingness to contribute to this project.

First about plans to support USX 3 (or in other words any new bible format) and implement it myself: It mainly depends how many bibles are available/circulating in that format. As a ballpark figure, there should be at least a few dozen Bibles (in more than one language) available which are published in the new format primarily (or exclusively), or alternatively a few hundred Bibles which are available in the new format in addition to being available in other formats implemented by BibleMultiConverter. As ebible.org (the largest repository I know that has USFM/USFX/USX bibles) still uses 2.6, there would have to be other sources (which may exist and I am not aware of) to make the format interesting for me.

For contributing an implementation of a new format, the threshold is a lot lower - if it compiles and if it is able to convert the test case bibles to the new format without crashing, I'm fine with incorporating them (even more if you are willing to follow this repo and don't mind getting issues about your format assigned to you). If you try to obtain software that can convert this formats (so probably Paratext) and compare the Paratext dumps before/after converting bibles both with the official software and your contribution, even better.

Now in particular about USX 3:

As I see it (from a quick glance to the spec and the list of changes), USX3 is paired with a new USFM3 format, which introduces new meta tags (like \usfm), new character content and paragraph content formatting tags, and even some fundamental change for handling some concepts (Strong numbers and morphology information are now handled via USFM attributes, for example, instead of the crude not really standardized syntax used before). Therefore, I think the correct route of implementing this format would be to have an AbstractParatext3Format class (copied or derived from AbstractParatextFormat), which uses modified versions of the enumerations in the ParatextBook and ParatextCharacterContent classes (or even own versions of it), and have this new class handle to conversion between the USFM3/USX3 concepts and BibleMultiConverter's one. Then have USFM3, USX3 and ParatextDump3 formats (I think USFX has no Paratext 3 counterpart) and adapt ParatextConverter to detect whether source/destination format are ParatextFormats or Paratext3Formats and convert accordingly: go the fast route if they are from the same family, convert via BibleMultiConverter's concepts (or even have a custom conversion, depending how fancy you want it) for different family. This may also depend on whether you want to convert your USX3 to USFM2 or USFM3.

As I said, this was only a quick glance at the spec. Maybe it is possible to use the same AbstractParatextFormat and use some tagging to distinguish which of the paragraph or character content tags are Version 3 and how to best "downconvert" them to Version 2.6. So the old formats will have to downconvert all new features before outputting them. What I don't want is an exporter that exports USFM 2.6 files that contain USFM 3 tags or vice versa.

The USX2.6 Schema also needed a bunch of edits, mainly because the Relax-NG conversion got rid of all the enumeration values, which are very useful when writing a converter, as you can use code completion and other features to test if you implemented support for every allowed value. So I'm not surprised you have some difficulties converting the USX3 schema. If you are hitting a dead end at some point, tell me - I cannot promise I'll find time soon, but eventually I will find some time to look at it. You should not have to rewrite or edit any of the generated classes (if you need, please ask as there should be a way to avoid that).

I'd suggest not to "hijack" the old schema but make a new one. I don't care if you create it by manually updating the old schema, editing the converted new schema, or a mixture of both. You should update the known schema names in ValidateXML class and validate some real-world USX3 bibles to make sure your final schema is correct.

About Peripheral content - it again boils down to how common this feature is in "real-world" bibles. My USFM2.6 implementation does not support Extended Study Content or 3 alternate verse numbers for each verse supported by the spec, as these features require quite some effort to implement, yet are not used in most available bibles. BibleMultiConverter has support for a Bible Introduction, two Testament Introductions, Book/Chapter prologs (at the beginning of books/chapters), and an Appendix, which are each streams of the "FormattedText" elements (so just Rich Text without any semantic tagging). So you can at least map some of the peripheral content to those sections, and put all the rest either into the Bible Introduction or the Appendix. For your use case of converting USX3 to USFM3, it does not matter, as it should take the short cut route and will keep those tags as is.

Last but not least, sorry if I missed any questions. Kindly ask again :)

Regards,

Michael

Rolf-Smit commented 3 years ago

Hi Michael,

Thanks for answering my questions so quick and detailed.

I'm currently looking to at least reading USX3 into Java objects. The format I need is a simplified and custom version of USFM3 that is used by my mobile app. This means I don't necessarily need a direct conversion to official USFM3 or 2 for that matter, as soon as I have access to Java objects it is easy enough to write my own custom format. The reason for this custom format is that USFM when simplified and normalised is really fast and efficient to parse, faster than any XML parser even. When I first wrote this app speed was quite important. Ok enough background.

Currently I'm getting most of the Bibles from the Digital Bible Library (DBL) which currently uses USX3 as it's official format, especially now that ParaText 9 has been out for quite some time, the amount of Bibles available in USX3 keeps growing. I was using this library as a step in between my own simplifier and normaliser tool that only accepts USFM as input. By using the BibleMultiConverter I was able to also process USX. As you can imagine with the growing number of USX3 available Bibles I really need to be able to convert from that format as well. I have considered moving to USX3 completely as the internal format used by the app, but XML parsing performance is just not fast enough to satisfy my needs ;)

I would love to add support for USX3 to this library, but the schema alone gives me headaches:

The schema provided by UBS is in the Relax NG format, which has no tooling whatsoever available to generate POJO's from (at least not in the Java world). Creating an XSD from this schema can be done, but it will never be as strict and descriptive as its Relax NG counterpart. For example the Relax NG schema for USX3 defines Footnote and CrossReference as two completely different types, however in xml they both use the note element. There is no good way to represent that in an XSD file, in XSD those two types would share the same type. Or take for example the chapter element which can be both a EndChapter and StartChapter. Due to the limitations of XSD (or maybe my lack of knowledge) these two would also end up sharing the same type, which means you need to look at certain attributes in code to check if it is and end or start chapter.

Anyhow...

I think I'm going to make an attempt at this, but I will start with an MVP implementation that supports only Scripture content and not Peripheral content and will only import USX3 and export USFM3. By looking at the differences between the USX3 and USX2 specification it indeed seems to make more sense to have an AbstractParatext3Format.

Btw: It seems more recent versions of Trang do generate the enumerations, but the Relax NG schema is so full of features that it basically generates an unusable SAX compliant XSD (ambiguity between different types that share the same element): usx3-con.xsd.txt

schierlm commented 3 years ago

OK, I have stumbled upon the DBL website a few times, but never could find out how mere mortals could register there to download their content...

About the Relax NG. It is nice that Relax NG can do such things as different elements with same name, on the other hand, this will make validation and parsing a lot slower. Perhaps even to the point where it gets Turing complete? Parsing C++ templates, unlike Java, is Turing complete and there is an infamous C++ program of 4 lines (of 80 chars each) that takes several hours to even validate it. When Paratext defines their own XML format, they could have used different tags for different elements and avoided this altogether.

Using JXB binding mappings, you could assign different classes to the same depending on which inside it is taken, but that does not help you here as the difference is in attribute presence (or even attribute value) and attributes may not be inside , only subelements. You could still map the tag to two different possible classes, but when parsing from XML only one would ever get generated. So you'd probably have to live with unifying the elements into one, and checking attributes in code to find out which one is the right one. Or alternatively run the XML through some transformations (XSL or otherwise) to disambiguate the tag names based on presence of attributes, before converting them into Java objects. Which you will have to reverse when exporting USX files.

And about the enumerations, I think I was a bit unclear here. When having an xs:attribute that has an anonymous simple type restriction with enumeration values, JAXB will still not create enumerations from them. You'd have to change them to have a named simple type, which is defined as an enumeration. So even the new trang output will need some manual improvement.

Rolf-Smit commented 3 years ago

OK, I have stumbled upon the DBL website a few times, but never could find out how mere mortals could register there to download their content...

Yea this is a bit of an issue, I got to join them but it is quite a hassle, prefer to talk about that in private.

When Paratext defines their own XML format, they could have used different tags for different elements and avoided this altogether.

I don't see why they did not go this route, it makes a lot of sense from a usability perspective. Based on your experience with XSD and it's possibilities I think the easiest way to do this is to have unified models.

I'm almost done with the XSD file, as soon as I finish I will point you to a branch so you can keep an eye on the work and give me some tips along the way.

Rolf-Smit commented 3 years ago

Hi Micheal,

First of all thanks for helping me out here and answering so quick and detailed. Really appreciated!

I have a few thoughts/questions to share:

Reuse existing Paratext classes

It seems since the structure of USX2 and USX3 is really comparable we can reuse the internal Paratext classes if we like. The main changes are basically additional Char en Para styles and the end chapter/verse milestones. But there is a small catch...

How is during an export made sure only supported attributes/elements/milestones are exported?

For example ParagraphKind.PERIPHERALS (periph) is not supported in USFM2/USX2, however it seems like this tag can be imported by certain formats (USFX), then during export how is this thing filtered out?

It seems you have already answered this question here:

As I said, this was only a quick glance at the spec. Maybe it is possible to use the same AbstractParatextFormat and use some tagging to distinguish which of the paragraph or character content tags are Version 3 and how to best "downconvert" them to Version 2.6. So the old formats will have to downconvert all new features before outputting them. What I don't want is an exporter that exports USFM 2.6 files that contain USFM 3 tags or vice versa.

However ParagraphKind.PERIPHERALS is not a new thing, it already exists, but I would not expect this tag to end up in a USFM 2 file (Looking at the specification I don't see any mention of the periph tag)

I'm obviously asking because I need to add some new ParagraphKinds such as po and qd which are new to USFM3/USX3. Exactly like you mentioned we need some way to filter those out, or downgrade those.

I was thinking about a few ways to solve this:

  1. Add versions to the *Kind enum values, so that we exactly know in which version a tag was added or removed. This allows for easy filtering, and would by writing some manual code also allow for downgrade or upgradeability of certain tags/elements etc. But this is a lot of manual work and could make the code a big spaghetti.

  2. Simply disallow direct conversion (skipping the internal BibleMultiConverter format) between different USFM/USX versions, this means you can convert from USX3 to USFM3 but not from USX3 to USFM2. We could still allow converting from USX3 to USFM2 using the internal BibleMultiConverter format, because that strips quite some information, but we can only be sure we end up with a compliant USFM2 or USFM3 file if we only use certain *Kinds that are supported between the two versions. So this does not feel like the perfect solution.

  3. Don't reuse the same Paratext models for USFM3/USX3 this means direct conversions are never possible between two different versions of USFM/USX. However using the internal BibleMultiConverter it would still be possible with some loss of features. This also avoids having to add versions to certain *Kind enums. This is probably also way more manageable if in the future USFM/USX 4 arrives.

Any advice?

schierlm commented 3 years ago

(Looking at the specification I don't see any mention of the periph tag)

The \periph tag existed in USFM 2.5 (https://ubsicap.github.io/usfm/usfm2.5/peripherals/index.html) as well as the USX2 schema (https://github.com/schierlm/BibleMultiConverter/blob/91949b9eda71b17a34925e7f3a54ffb2c7397523/biblemulticonverter-schemas/src/main/resources/usx.xsd#L305)

As I understand it, the main difference between USFM 2 and USFM3 is her, that USFM 2 allows arbitrary content (titles) in the \periph tag, while USFM 3 provides an enumeration of allowed values for it. Which would mean the conversion can go 3->2 without problem, while 2->3 would have to check the values...

About how to handle the path, I'm fine with either way. If you disallow the direct conversion between version 2 and version 3, it's fine. If you have some list of "problematic tags" and disallow it only if one of them is included, even better. If you convert the problematic tags, I don't mind either. You can probably have convert functions at different depths of the type hierarchy to avoid a big ball of spaghetti mud converter, so you can call ParatextBook.convertTo(USFMVersion.V2) and it will delegate down or throw if not possible. Or have a map of character elements that are USFM3 only, with the value being the replacement USFM2 value (or e.g. map them to NORMAL if they should get removed). Mainly depends on how the types changed (if they only got new attributes or did not exist at all, or restricted their values).

As the new chapter/verse end tags seem to exist only in USX3 and not in USFM3, probably you don't need to add them to the internal representation, but just synthesize them during export and strip them during import

Rolf-Smit commented 3 years ago

The \periph tag existed in USFM 2.5 (https://ubsicap.github.io/usfm/usfm2.5/peripherals/index.html) as well as the USX2 schema...

I must have missed that, it is indeed!

Rolf-Smit commented 3 years ago

@schierlm this took a bit longer than expected, but I think I'm there. Only thing I left out is:

Create USFM3 implementation of AbstractParatextFormat that exports to USFM 3.0 from the internal Paratext models.

Since for now I don't need import or export support for USFM 3, but I may add it in the future.

PR is here: https://github.com/schierlm/BibleMultiConverter/pull/39

I'm can image it takes some time to look at this, so take your time.

Rolf-Smit commented 3 years ago

@schierlm I'm closing this issue, as the PR has been merged.

schierlm commented 3 years ago

Ok.

Just a note, I may have to reduce the size of the USX Genesis book test cases, as I have to agree that for a 2MB total source code size, having >25% of it for a single test case may be a bit overkill. Do you have any preference which chapters to keep in the test? From a quick glance I would keep chapters 1 and 3, which is about 5% of the original size and about 2% of the total source code size.