Closed proycon closed 3 years ago
If everything in the text content (<t>
) is put on a single line (without spaces or newlines), then everything validates fine.
This also shows that the cause of this issue are spaces caused by joining lines, which is behaviour we usually want to have:
<t-str>foo</t-str>
<t-str>bar</t-str>
The above should serialize as foo bar
, with a space. The libraries do this correctly.
But... if we have an explicit linebreak:
<t-str>foo</t-str>
<br/>
<t-str>bar</t-str>
then this no longer makes sense and we want foo\nbar
and not foo\s\n\sbar
. I think this is the core of this issue.
I'm afraid we may have to add another chapter to our whitespace problems, this is the sequel to issue #88 ...
i have a paragraph with the following text:
This is produced by my latest additions to FoLiA-page (PageXML to FoLiA conversion,
pagexml-br
branch of foliautils). In addition, PageXML generates string annotations, which in turn relate back to the original PageXML:The problem is, the offsets don't match up because of leading/trailing spaces. foliavalidator and folialint report the same:
The full text the library sees, and which is produced by both folia2txt and FoLiA-2text. I marked leading/trailing whitespace with an underscore for visibility:
Note the initial whitespace for all but the first line. So where I'd expect
S\nJ
we getS\s\n\sJ
instead. I think this is unexpected behaviour and qualifies as a bug we'd want to fix. The offsets as reported in the FoLiA-page output seem correct to me.