Closed ariasrodolfo closed 1 year ago
OSIS is a mess. Running it through a reformatter can in some cases introduce whitespace at locations that are significant (resulting in extra whitespace in a bible text with many grammar information etc). It is definitely not intended to be edited manually anyway, I'd rather use a text-based format like Diffable
for that. Simpler formats like Zefania XML don't have that many options and therefore don't suffer from that problem, so I left the reformatter enabled when exporting to Zefania XML.
If you want reformatted OSIS files on Debian, I'd suggest to install libxml2-utils
and pipe your OSIS file through xmllint --format
. Note that this reformatting may have the same effect as Java's built-in XML reformatter, but if you care about whitespace around tag boundaries (e.g. verses) you can convert both files to e.g. Diffable
format and diff the output to check whether there were actually whitespace changes introduced.
If that is not an option for you, I can add an export option to BibleMultiConverter to produce reformatted OSIS files. I would not want to make it the default, though.
what I wanted to achieve was to convert to osis, then modify and introduce some improvements and export to theswordproject with the tools of that website. I am working with files that already contain strongs numbering, lemma and other tags, if reformatting can cause the tags to be lost or wrong, then obviously it is not convenient for me. However I might suggest that you implement an option for reformatting that is not by default but rather optional, or as a separate tool. I'll try libxml2-utils. Thanks.
It is not the tags that can get lost, but whitespace in between them, if a tag consists of only tags and whitespace (and no other text aka mixed content). When you have
<p><w lemma="strong:G1812">six hundred and</w> <w lemma="strong:G1835">sixty-</w><w lemma="strong:G1803">six</w></p>
it gets reformatted as
<p>
<w lemma="strong:G1812">six hundred and</w>
<w lemma="strong:G1835">sixty-</w>
<w lemma="strong:G1803">six</w>
</p>
which (when rendered as a text) is
six hundred and sixty- six
instead of
six hundred and sixty-six
But maybe you do not care about such small detail.
If you have the and
outside the lemma,
<p><w lemma="strong:G1812">six hundred</w> and <w lemma="strong:G1835">sixty-</w><w lemma="strong:G1803">six</w></p>
it does not get reformatted anyway (the whole <p>
stays on one line) and the problem does not happen. Therefore it is pretty rare, yet still I do not want to make it the default.
Thanks for the explanation, very graphic and illustrative, it helped me to understand a lot. I use the xmllint tool as follows:
$xmllint --format -o formatted-bible.xml 1line-osisbible.xml
The result was an osis document where it was indented with a depth of 1 chapter per line. as follows:
.......
<header>
<work osisWork="Exported">
<title> </title>
</work>
</header>
<div canonical="true" osisID="Gen" type="book">
<title type="main">Genesis</title>
<chapter osisID="Gen.1"><title type="chapter">Gn 1</title><verse osisID="Gen.1.1" sID="Gen.1.1"/>text text <w lemma="strong:H7225">text</w><w lemma="strong:H1254">text</w><w lemma="strong:H430">text</w> text <w lemma="strong:H8064">text</w> text text <w lemma="strong:H776">text.</w><verse eID="Gen.1.1"/><verse osisID="Gen.1.2" sID="Gen.1.2"/>text<w lemma="strong:H776">text</w> <w lemma="strong:H1961">text</w> <w lemma="strong:H8414">text</w> ......
<chapter osisID="Gen.2"><title type="chapter">Gn 2</title><verse osisID="Gen.2.1" sID="Gen.2.1"/><w lemma="strong:H3615">text,</w> text, text text<w lemma="strong:H8064">text</w> text text <w lemma="strong:H776">text,</w> y <w lemma="strong:H3605">text</w> <w lemma="strong:H6635">text</w> text text text.<verse eID="Gen.2.1"/><verse osisID="Gen.2.2" sID="Gen.2.2"/>text <w lemma="strong:H3615">text</w>..........
....
</div>
<div canonical="true" osisID="Exod" type="book">
<title type="main">Exodo</title>
<chapter osisID="Exod.1"><title type="chapter">Ex 1</title><verse osisID="Exod.1.1" sID="Exod.1.1"/><w lemma="strong:H428">text</w> text text <w lemma="strong:H8034">text</w> text text <w lemma="strong:H1121">text</w> de <w lemma="strong:H3478">text</w> <w lemma="strong:H935"/> text text text <w lemma="strong:H4714">text</w> <w lemma="strong:H854">text</w> <w lemma="strong:H3290"/> text; «text <w lemma="strong:H376">text</w> text» ....
<chapter osisID="Exod.2">........<
..................
and so on
here each line represents 1 complete chapter of each book, let's call it chapter-level indent. This was satisfactory for me because at least it is easier to edit, many editors have restrictions on the amount of text per line, so editing is very slow and unpleasant. With graphical tools, Geany IDE has given good results, others have not.
The problem about the spaces between the words, it seems to me that it can only appear in the case where it is indented word by word, but that seems excessive. I think that a better way would be to look for an indentation only at the verse level, this is where each line represents a verse, in the understanding that each verse ends in a complete word, the problem of spaces between the same word could not arise as it described in your previous answer. This could be done by identifying the clear label for the verses (<verse osisID=") and including an indentation at the appropriate place, something like the following structure:
.......
<header>
<work osisWork="Exported">
<title> </title>
</work>
</header>
<div canonical="true" osisID="Gen" type="book">
<title type="main">Genesis</title>
<chapter osisID="Gen.1"><title type="chapter">Gn 1</title>
<verse osisID="Gen.1.1" sID="Gen.1.1"/>text text <w lemma="strong:H7225">text</w><w lemma="strong:H1254">text</w><w lemma="strong:H430">text</w> text <w lemma="strong:H8064">text</w> text text <w lemma="strong:H776">text.</w><verse eID="Gen.1.1"/>
<verse osisID="Gen.1.2" sID="Gen.1.2"/>text<w lemma="strong:H776">text</w> <w lemma="strong:H1961">text</w> <w lemma="strong:H8414">text</w> ......
<chapter osisID="Gen.2"><title type="chapter">Gn 2</title>
<verse osisID="Gen.2.1" sID="Gen.2.1"/><w lemma="strong:H3615">text,</w> text, text text<w lemma="strong:H8064">text</w> text text <w lemma="strong:H776">text,</w> y <w lemma="strong:H3605">text</w> <w lemma="strong:H6635">text</w> text text text.<verse eID="Gen.2.1"/>
<verse osisID="Gen.2.2" sID="Gen.2.2"/>text <w lemma="strong:H3615">text</w>..........
....
</div>
<div canonical="true" osisID="Exod" type="book">
<title type="main">Exodo</title>
<chapter osisID="Exod.1"><title type="chapter">Ex 1</title>
<verse osisID="Exod.1.1" sID="Exod.1.1"/><w lemma="strong:H428">text</w> text text <w lemma="strong:H8034">text</w> text text <w lemma="strong:H1121">text</w> de <w lemma="strong:H3478">text</w> <w lemma="strong:H935"/> text text text <w lemma="strong:H4714">text</w> <w lemma="strong:H854">text</w> <w lemma="strong:H3290"/> text; «text <w lemma="strong:H376">text</w> text» ....
<chapter osisID="Exod.2">........<
..................
and so on
certainly in the indentation by chapter the problem does not appear because each chapter is conclusive. Even so, the way of indenting at the level of verse by line, I think it should be satisfactory for anyone. If you have any idea to achieve a tool that can do it, then I suggest you implement it, as an option , It can be helpful. thanks
XML formatters usually do not know about content, so they do not care if they wrap each chapter, verse, lemma or book.
When there is any structuring content (paragraphs, headlines, tables, linegroups) or quotations present, they should also get wrapped. In case a chapter/verse has all words wrapped in lemma (as the previous example), they would also get wrapped. Single words (i.e. text content) will never get wrapped.
If your bible does not contain any structuring content, you may want to try OSIS with unmilestoned verses instead (see the output options of the OSIS format on how to do this, by passing a -
as second parameter).
Then, instead of
<verse osisID="Gen.1.1" sID="Gen.1.1"/>Text<verse eID="Gen.1.1"/>
The structure will then be
<verse osisID="Gen.1.1">Text</verse>
And structuring/wrapping as well as manipulating the XML with e.g. XSL will get a lot easier. Only drawback is that some OSIS tools require milestoned verse tags and won't be able to read this file afterwards.
Greetings, first of all thank you for this tool, now.... when converting from mybible file (and others) to osis format, the result is a file that contains the text in osis but in a single line, which is very, very difficult to edit, I use the program from a linux debian system.
I explain, what is expected:
however, the result is like:
a single (almost infinite) line I am looking for an osis to which improvements can be made, conversion to other xml based ones like zefania seem to be fine thanks.