htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.71k stars 418 forks source link

Option to do not break tag with xml:space="preserve" attribute #812

Open rysavyjan opened 5 years ago

rysavyjan commented 5 years ago

We are formating DOCX XML files with tidy and we have found that following tag

Hello World!

is converted to

Hello World!

Unfortunately such document opened in Microsoft Word has spaces around Hello World!: word

We didn't find any workaround in tidy configuration (latest version built from this repository).

geoffmcl commented 5 years ago

@rysavyjan thanks for the issue, but ATM not sure I understand what is the problem... please explain...

It is my understanding that the attr xml:space="preserve" only preserves the spaces in the text, like say Hello World, and it seems tidy will do that...

But the Pretty Print output, maybe for XHTML human readability reasons, the open and close w:t tags, will be on newlines... converted, as you show... is the output valid? need W3C xhtml references...

I have not used MS Word for ages, but can see it may render it, converting newlines to spaces... as your image shows...

So, what is wrong here?

  1. Tidy's XHTML output, or
  2. MS Word's rendering of it?

I obviously think it is the latter... but open to reasoning...

Have you tried the --vertical-space auto option? This will remove most of the newlines in the output... how is that displayed? Suitable?

Otherwise, I can not think of any other configuration option, so this would have to be a new xhtml option... what name? expanded specs? doc suggestions, etc, etc... see OPTIONS.md...

Is there a sufficient use case for such an option?

Look forward to further feedback, comments, patches, PR, for testing... thanks...

rysavyjan commented 5 years ago

Thank you for your ideas.

What I find interesting is that <w:t>Hello World!</w:t> or for example <w:t xml:space="preserve_DUMMY">Hello World!</w:t> is not divided into multiple rows.

So we are using following "dirty" hack for now:

sed -i 's/xml:space="preserve"/xml:space="preserve_HACK"/g' "${file}"
tidy -config $CFG -output "${file}" "${file}"
sed -i 's/xml:space="preserve_HACK"/xml:space="preserve"/g' "${file}"

Maybe it could help others who might have the same problem...

Regarding --vertical-space auto, it doesn't help, whole XML output is in one line. From Tidy documentation:

If set to auto Tidy will eliminate nearly all newline characters.

AnrDaemon commented 3 years ago

not sure I understand what is the problem... please explain...

The problem is that xml:space=preserve tells parser to preserve all whitespace, not just \x20. Tidy adding 2 extra whitespaces (assuming eol=LF) to each space (<w:t xml:space="preserve"> </w:t>) produced by MS word, for example. When XML parser then reading formatted document, each whitespace is converted to space and displayed as such. The end result is that document is damaged and the end user should use tricks to restore document to the desired form. Using tidy option --new-inline-tags w:t does not help, and --new-pre-tags w:t making it worse.

P.S. This is for tidy -xml -utf8 --config ~/tidyrc-xml.txt tidyrc-xml.txt

sfriesel commented 4 days ago

It is my understanding that the attr xml:space="preserve" only preserves the spaces in the text, like say Hello World, and it seems tidy will do that...

Generic XML does not have specific rules for what xml:space="preserve" means. It is just an indicator for the processing application (https://www.w3.org/TR/xml/#sec-white-space). If tidy-html5 is only intended for HTML-like documents, then OP's issue is out of scope for this tool. Otherwise there should be an option/mode that leaves whitespace completely alone in xml:space="preserve" tags.