schierlm / BibleMultiConverter

Converter written in Java to convert between different Bible program formats
Other
124 stars 33 forks source link

single line OSIS file output #69

Closed ariasrodolfo closed 1 year ago

ariasrodolfo commented 1 year ago

Greetings, first of all thank you for this tool, now.... when converting from mybible file (and others) to osis format, the result is a file that contains the text in osis but in a single line, which is very, very difficult to edit, I use the program from a linux debian system.

I explain, what is expected:

<?xml version="1.0" encoding="UTF-8"?>
<osis xmlns=""
 xmlns:xsi=""
 xsi:schemaLocation="">
<osisText osisIDWork="" osisRefWork="bible" xml:lang="en" canonical="true">
<header>
 <work osisWork="">
  <title></title>
  <type type="OSIS">Bible</type>
  <identifier type="OSIS"></identifier>
  <rights type="x-copyright"> </rights>
  <scope> </scope>...........

however, the result is like:

<?xml` version="1.0" encoding="UTF-8"?> <osis xmlns="" xmlns:xsi="" xsi:schemaLocation=""> <osisText osisIDWork="" osisRefWork="bible" xml:lang ="in" canonical="true"> <header> <work osisWork=""> <title></title> <type type="OSIS">Bible</type> <identifier type="OSIS"></identifier>  <rights type="x-copyright"> </rights> <scope></scope>...........`

a single (almost infinite) line I am looking for an osis to which improvements can be made, conversion to other xml based ones like zefania seem to be fine thanks.

schierlm commented 1 year ago

OSIS is a mess. Running it through a reformatter can in some cases introduce whitespace at locations that are significant (resulting in extra whitespace in a bible text with many grammar information etc). It is definitely not intended to be edited manually anyway, I'd rather use a text-based format like Diffable for that. Simpler formats like Zefania XML don't have that many options and therefore don't suffer from that problem, so I left the reformatter enabled when exporting to Zefania XML.

If you want reformatted OSIS files on Debian, I'd suggest to install libxml2-utils and pipe your OSIS file through xmllint --format. Note that this reformatting may have the same effect as Java's built-in XML reformatter, but if you care about whitespace around tag boundaries (e.g. verses) you can convert both files to e.g. Diffable format and diff the output to check whether there were actually whitespace changes introduced.

If that is not an option for you, I can add an export option to BibleMultiConverter to produce reformatted OSIS files. I would not want to make it the default, though.

ariasrodolfo commented 1 year ago

what I wanted to achieve was to convert to osis, then modify and introduce some improvements and export to theswordproject with the tools of that website. I am working with files that already contain strongs numbering, lemma and other tags, if reformatting can cause the tags to be lost or wrong, then obviously it is not convenient for me. However I might suggest that you implement an option for reformatting that is not by default but rather optional, or as a separate tool. I'll try libxml2-utils. Thanks.

schierlm commented 1 year ago

It is not the tags that can get lost, but whitespace in between them, if a tag consists of only tags and whitespace (and no other text aka mixed content). When you have

<p><w lemma="strong:G1812">six hundred and</w> <w lemma="strong:G1835">sixty-</w><w lemma="strong:G1803">six</w></p>

it gets reformatted as

<p>
    <w lemma="strong:G1812">six hundred and</w>
    <w lemma="strong:G1835">sixty-</w>
    <w lemma="strong:G1803">six</w>
</p>

which (when rendered as a text) is

six hundred and sixty- six

instead of

six hundred and sixty-six

But maybe you do not care about such small detail.

If you have the and outside the lemma,

<p><w lemma="strong:G1812">six hundred</w> and <w lemma="strong:G1835">sixty-</w><w lemma="strong:G1803">six</w></p>

it does not get reformatted anyway (the whole <p> stays on one line) and the problem does not happen. Therefore it is pretty rare, yet still I do not want to make it the default.

ariasrodolfo commented 1 year ago

Thanks for the explanation, very graphic and illustrative, it helped me to understand a lot. I use the xmllint tool as follows:

$xmllint --format -o formatted-bible.xml 1line-osisbible.xml

The result was an osis document where it was indented with a depth of 1 chapter per line. as follows:

 .......
    <header>
      <work osisWork="Exported">
        <title>  </title>
      </work>
    </header>

    <div canonical="true" osisID="Gen" type="book">
      <title type="main">Genesis</title>
      <chapter osisID="Gen.1"><title type="chapter">Gn 1</title><verse osisID="Gen.1.1" sID="Gen.1.1"/>text text <w lemma="strong:H7225">text</w><w lemma="strong:H1254">text</w><w lemma="strong:H430">text</w> text <w lemma="strong:H8064">text</w> text text <w lemma="strong:H776">text.</w><verse eID="Gen.1.1"/><verse osisID="Gen.1.2" sID="Gen.1.2"/>text<w lemma="strong:H776">text</w> <w lemma="strong:H1961">text</w> <w lemma="strong:H8414">text</w> ......
      <chapter osisID="Gen.2"><title type="chapter">Gn 2</title><verse osisID="Gen.2.1" sID="Gen.2.1"/><w lemma="strong:H3615">text,</w> text, text text<w lemma="strong:H8064">text</w> text text <w lemma="strong:H776">text,</w> y <w lemma="strong:H3605">text</w> <w lemma="strong:H6635">text</w> text text text.<verse eID="Gen.2.1"/><verse osisID="Gen.2.2" sID="Gen.2.2"/>text <w lemma="strong:H3615">text</w>.......... 
      ....
    </div>
    <div canonical="true" osisID="Exod" type="book">
      <title type="main">Exodo</title>
      <chapter osisID="Exod.1"><title type="chapter">Ex 1</title><verse osisID="Exod.1.1" sID="Exod.1.1"/><w lemma="strong:H428">text</w> text text <w lemma="strong:H8034">text</w> text text <w lemma="strong:H1121">text</w> de <w lemma="strong:H3478">text</w> <w lemma="strong:H935"/> text text text <w lemma="strong:H4714">text</w> <w lemma="strong:H854">text</w> <w lemma="strong:H3290"/> text; «text <w lemma="strong:H376">text</w> text» ....
      <chapter osisID="Exod.2">........<
      ..................
      and so on

here each line represents 1 complete chapter of each book, let's call it chapter-level indent. This was satisfactory for me because at least it is easier to edit, many editors have restrictions on the amount of text per line, so editing is very slow and unpleasant. With graphical tools, Geany IDE has given good results, others have not.

The problem about the spaces between the words, it seems to me that it can only appear in the case where it is indented word by word, but that seems excessive. I think that a better way would be to look for an indentation only at the verse level, this is where each line represents a verse, in the understanding that each verse ends in a complete word, the problem of spaces between the same word could not arise as it described in your previous answer. This could be done by identifying the clear label for the verses (<verse osisID=") and including an indentation at the appropriate place, something like the following structure:

.......
    <header>
      <work osisWork="Exported">
        <title>  </title>
      </work>
    </header>

    <div canonical="true" osisID="Gen" type="book">
      <title type="main">Genesis</title>
      <chapter osisID="Gen.1"><title type="chapter">Gn 1</title>
                                <verse osisID="Gen.1.1" sID="Gen.1.1"/>text text <w lemma="strong:H7225">text</w><w lemma="strong:H1254">text</w><w lemma="strong:H430">text</w> text <w lemma="strong:H8064">text</w> text text <w lemma="strong:H776">text.</w><verse eID="Gen.1.1"/>
                                <verse osisID="Gen.1.2" sID="Gen.1.2"/>text<w lemma="strong:H776">text</w> <w lemma="strong:H1961">text</w> <w lemma="strong:H8414">text</w> ......
      <chapter osisID="Gen.2"><title type="chapter">Gn 2</title>
                                <verse osisID="Gen.2.1" sID="Gen.2.1"/><w lemma="strong:H3615">text,</w> text, text text<w lemma="strong:H8064">text</w> text text <w lemma="strong:H776">text,</w> y <w lemma="strong:H3605">text</w> <w lemma="strong:H6635">text</w> text text text.<verse eID="Gen.2.1"/>
                                <verse osisID="Gen.2.2" sID="Gen.2.2"/>text <w lemma="strong:H3615">text</w>.......... 
      ....
    </div>
    <div canonical="true" osisID="Exod" type="book">
      <title type="main">Exodo</title>
      <chapter osisID="Exod.1"><title type="chapter">Ex 1</title>
                                <verse osisID="Exod.1.1" sID="Exod.1.1"/><w lemma="strong:H428">text</w> text text <w lemma="strong:H8034">text</w> text text <w lemma="strong:H1121">text</w> de <w lemma="strong:H3478">text</w> <w lemma="strong:H935"/> text text text <w lemma="strong:H4714">text</w> <w lemma="strong:H854">text</w> <w lemma="strong:H3290"/> text; «text <w lemma="strong:H376">text</w> text» ....
      <chapter osisID="Exod.2">........<
      ..................
      and so on

certainly in the indentation by chapter the problem does not appear because each chapter is conclusive. Even so, the way of indenting at the level of verse by line, I think it should be satisfactory for anyone. If you have any idea to achieve a tool that can do it, then I suggest you implement it, as an option , It can be helpful. thanks

schierlm commented 1 year ago

XML formatters usually do not know about content, so they do not care if they wrap each chapter, verse, lemma or book.

When there is any structuring content (paragraphs, headlines, tables, linegroups) or quotations present, they should also get wrapped. In case a chapter/verse has all words wrapped in lemma (as the previous example), they would also get wrapped. Single words (i.e. text content) will never get wrapped.

If your bible does not contain any structuring content, you may want to try OSIS with unmilestoned verses instead (see the output options of the OSIS format on how to do this, by passing a - as second parameter).

Then, instead of

<verse osisID="Gen.1.1" sID="Gen.1.1"/>Text<verse eID="Gen.1.1"/>

The structure will then be

<verse osisID="Gen.1.1">Text</verse>

And structuring/wrapping as well as manipulating the XML with e.g. XSL will get a lot easier. Only drawback is that some OSIS tools require milestoned verse tags and won't be able to read this file afterwards.