schierlm / BibleMultiConverter

Converter written in Java to convert between different Bible program formats
Other
124 stars 33 forks source link

Error converting USFX file #56

Closed bmaupin closed 2 years ago

bmaupin commented 2 years ago
  1. I downloaded fraLSG_usfx.zip from here: https://ebible.org/details.php?id=fraLSG&all=1 and extracted all the files
  2. I downloaded this tool and ran the following command:

    java -jar BibleMultiConverter.jar USFX fraLSG_usfx.xml USX3

This was the error I got:

$ java -jar BibleMultiConverter.jar USFX fraLSG_usfx.xml USX3
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.sun.xml.bind.v2.runtime.reflect.opt.Injector (file:/home/user/Desktop/BibleMultiConverter/lib/jaxb-impl-2.2.11.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int)
WARNING: Please consider reporting this to the maintainers of com.sun.xml.bind.v2.runtime.reflect.opt.Injector
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
WARNING: Skipping unsupported tag outside of book: languageCode
WARNING: Unexpected tag: ide
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index -1 out of bounds for length 0
    at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
    at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
    at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
    at java.base/java.util.Objects.checkIndex(Objects.java:372)
    at java.base/java.util.ArrayList.get(ArrayList.java:459)
    at biblemulticonverter.format.paratext.USFX.parseElement(USFX.java:273)
    at biblemulticonverter.format.paratext.USFX.parseElements(USFX.java:154)
    at biblemulticonverter.format.paratext.USFX.parseElement(USFX.java:178)
    at biblemulticonverter.format.paratext.USFX.parseElements(USFX.java:154)
    at biblemulticonverter.format.paratext.USFX.parseBook(USFX.java:122)
    at biblemulticonverter.format.paratext.USFX.doImportAllBooks(USFX.java:87)
    at biblemulticonverter.format.paratext.AbstractParatextFormat.doImportBooks(AbstractParatextFormat.java:282)
    at biblemulticonverter.format.paratext.AbstractParatextFormat.doImport(AbstractParatextFormat.java:82)
    at biblemulticonverter.Main.main(Main.java:66)

This is the version of java I'm using:

$ java -version
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing)

I'll attach the source files to save you some time.

Thanks!

bmaupin commented 2 years ago

fraLSG_usfx.zip

schierlm commented 2 years ago

Thank you for your bug report. Your Java version output is wrong, but probably you copy&pasted the wrong block.

Some notes:

That being said, you found a genuine bug in the USFX import, and while importing the file I found another one (ClassCastException for blank paragraphs that are encoded in USFX as <b> tags). I will push a fix for them shortly.

If you want to try the fixed version, I would suggest that you use

java -jar BibleMultiConverter.jar ParatextConverter USFX fraLSG_usfx.xml USX3 fraLSG_USX #-*.usx
bmaupin commented 2 years ago

Thank you for your bug report. Your Java version output is wrong, but probably you copy&pasted the wrong block.

Indeed :laughing:

Some notes:

  • USX3 export format requires two additional arguments (directory name and filename patten), as documented in help USX3
  • When converting between Paratexts formats, you can usually get better results (i.e. less content dropped, at the expense of a more complex file) when using the ParatextConverter module, which will avoid the conversion to and from BibleMultiConverter's internal format
  • USFX is quite an exotic beast (outside of ebible.org I don't know any users). USFM import is better tested, especially for files that do not conform 100% to the USFM/USFX specification.
  • Last but not least, when converting from USFM to USX3, you may also want to try the "official" tool (i.e. Paratext). The "Basic Tier" which is available for free (as in beer, not as in speech) is sufficient for this.

I knew USFX was to be avoided, but there are a lot of Bible sources on ebible.org that are harder to find elsewhere. I found a few at thedigitalbiblelibrary.org as USX files but they only have a small subset. The USFX writeup on ebible.org made it sound like it holds more information than USFM, so I was hoping to retain as much information as possible during the conversion to USX. At any rate the desired output was USX. I guess it would be worth comparing a USFM to USX conversion to a USFX to USX conversion to see if there really is any reason to use the USFX files at all as source files, even for converting.

That being said, you found a genuine bug in the USFX import, and while importing the file I found another one (ClassCastException for blank paragraphs that are encoded in USFX as <b> tags). I will push a fix for them shortly.

If you want to try the fixed version, I would suggest that you use

java -jar BibleMultiConverter.jar ParatextConverter USFX fraLSG_usfx.xml USX3 fraLSG_USX #-*.usx

Thanks for the tips and the quick fix!