Open rysavyjan opened 5 years ago
@rysavyjan thanks for the issue, but ATM not sure I understand what is the problem... please explain...
It is my understanding that the attr xml:space="preserve"
only preserves the spaces in the text, like say Hello World
, and it seems tidy
will do that...
But the Pretty Print
output, maybe for XHTML human readability reasons, the open and close w:t
tags, will be on newlines... converted, as you show... is the output valid? need W3C xhtml references...
I have not used MS Word for ages, but can see it may render it, converting newlines to spaces... as your image shows...
So, what is wrong here?
I obviously think it is the latter... but open to reasoning...
Have you tried the --vertical-space auto
option? This will remove most of the newlines in the output... how is that displayed? Suitable?
Otherwise, I can not think of any other configuration option, so this would have to be a new xhtml option... what name? expanded specs? doc suggestions, etc, etc... see OPTIONS.md...
Is there a sufficient use case for such an option?
Look forward to further feedback, comments, patches, PR, for testing... thanks...
Thank you for your ideas.
What I find interesting is that <w:t>Hello World!</w:t>
or for example <w:t xml:space="preserve_DUMMY">Hello World!</w:t>
is not divided into multiple rows.
So we are using following "dirty" hack for now:
sed -i 's/xml:space="preserve"/xml:space="preserve_HACK"/g' "${file}"
tidy -config $CFG -output "${file}" "${file}"
sed -i 's/xml:space="preserve_HACK"/xml:space="preserve"/g' "${file}"
Maybe it could help others who might have the same problem...
Regarding --vertical-space auto
, it doesn't help, whole XML output is in one line. From Tidy documentation:
If set to auto Tidy will eliminate nearly all newline characters.
not sure I understand what is the problem... please explain...
The problem is that xml:space=preserve
tells parser to preserve all whitespace, not just \x20
.
Tidy adding 2 extra whitespaces (assuming eol=LF
) to each space (<w:t xml:space="preserve"> </w:t>
) produced by MS word, for example.
When XML parser then reading formatted document, each whitespace is converted to space and displayed as such. The end result is that document is damaged and the end user should use tricks to restore document to the desired form.
Using tidy option --new-inline-tags w:t
does not help, and --new-pre-tags w:t
making it worse.
P.S.
This is for tidy -xml -utf8 --config ~/tidyrc-xml.txt
tidyrc-xml.txt
It is my understanding that the attr
xml:space="preserve"
only preserves the spaces in the text, like sayHello World
, and it seemstidy
will do that...
Generic XML does not have specific rules for what xml:space="preserve" means. It is just an indicator for the processing application (https://www.w3.org/TR/xml/#sec-white-space). If tidy-html5 is only intended for HTML-like documents, then OP's issue is out of scope for this tool. Otherwise there should be an option/mode that leaves whitespace completely alone in xml:space="preserve" tags.
We are formating DOCX XML files with tidy and we have found that following tag
is converted to
Unfortunately such document opened in Microsoft Word has spaces around
Hello World!
:We didn't find any workaround in tidy configuration (latest version built from this repository).