htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.69k stars 414 forks source link

indent and vertical-space issues (was With indent:yes and wrap:0, I don't want line breaks before inline elements) #486

Open LWillms opened 7 years ago

LWillms commented 7 years ago

This is the output of tidy 5.2 with the config options "indent: yes" and "wrap: 0": (I have cut out only the beginning of the lines via Column Mode of Ultraedit, in order not to force a line wrap...)

`       <li>                                                       
         <h2>                                                     
           Erster Abschnitt: Ware und Geld                        
         </h2>                                                    
         <ol>                                                     
           <li>                                                   
             <a href="me23_049.htm">Erstes Kapitel. Die Ware</a>  
             <ol>                                                 
               <li>                                               
                 <a href="me23_049.htm#Kap_1_1">Die zwei Faktoren 
               </li>                                              
               <li>                                               
                 <a href="me23_049.htm#Kap_1_2">Doppelcharakter de
               </li>                                              
               <li>                                               
                 <a href="me23_049.htm#Kap_1_3">Die Wertform oder 
                 <ol type="A">                                    
                   <li>                                           
                     <a href="me23_049.htm#Kap_1_3_A">Einfache, ei
                     <ol>                                         
                       <li>                                       
                         <a href="me23_049.htm#Kap_1_3_A_1">Die be
                       </li>                                      
                       <li>                                       
                         <a href="me23_049.htm#Kap_1_3_A_2">Die re
                         <ol type="a">                            
                           <li>                                   
                             <a href="me23_049.htm#Kap_1_3_A_2_a">
                           </li>                                  
                           <li>                                   
                             <a href="me23_049.htm#Kap_1_3_A_2_b">
                           </li>                                  
                         </ol>                                    
                       </li>                                      
                       <li>                                       
                         <a href="me23_049.htm#Kap_1_3_A_3">Die äq
                       </li>                                      

and this is how I would like to look it:

  <li><h2>Erster Abschnitt: Ware und Geld</h2>                 
    <ol>                                                       
      <li><A href="me23_049.htm">Erstes Kapitel. Die Ware</A>  
        <ol>                                                   
          <li><A href="me23_049.htm#Kap_1_1">Die zwei Faktoren 
          <li><A href="me23_049.htm#Kap_1_2">Doppelcharakter de
          <li><A href="me23_049.htm#Kap_1_3">Die Wertform oder 
            <ol type=A>                                        
              <li><A href="me23_049.htm#Kap_1_3_A">Einfache, ei
                <ol>                                           
                  <li><A href="me23_049.htm#Kap_1_3_A_1">Die be
                  <li><A href="me23_049.htm#Kap_1_3_A_2">Die re
                    <ol type=a>                                
                      <li><A href="me23_049.htm#Kap_1_3_A_2_a">
                      <li><A href="me23_049.htm#Kap_1_3_A_2_b">
                    </ol>                                      
                  </li>                                        
                  <li><A href="me23_049.htm#Kap_1_3_A_3">Die äq
                  <li><A href="me23_049.htm#Kap_1_3_A_4">Das Ga
                </ol>                                          

This is from a table of contents to a document which spans 38 files (including this TOC file).

The same applies to the actual text pages where a typical paragraph as produced by tidy would look like this:

<p>
Looooooooooooooooooong text line - whole paragraph
</p>

whereas in my view it would look better and would be more maintainable like this:

<p>Looooooooooooooooooong text line - whole paragraph ..... </p>

To edit the text of such a paragraph, I can switch the editor to wrap mode, so that I can see the whole text in the editor window.

Is there a hidden (hidden for my eyes, that is) config option which lets me get this? Or is this a case for a feature request? If yes, should the way I want to have it be the default way? Should it be alterable, with a new config option, to the way it is now, or the other way round?

I could cope with any way, but to me it seems to be more logical to have it my way...

Cheers, L.W.

geoffmcl commented 7 years ago

LWillms thank you for your issue...

Yes, my preference too would be to sometimes, in some cases, have less vertical newline space, especially in lists and tables. As you suggest, keeping certain tags in a single line, especially if wrap:0, which in a way even implies this...

Some little steps were taken in this regard, and an option --vertical-space auto/no/yes, ie vertical-space, was added. Take care, the auto will eliminate nearly all vertical space, and the yes will add some additional vertical space in some cases, so this does not help this case...

And it is further complicated by the fact that adding the indent option -i also adds vertical space. I do not understand your second <p>Looong text...</p> example. Tidy does seem to keep that inline?

This has been a hot topic, mentioned in many issues. See #158, #163, #179, #189, #227, #228, and probably others...

So yes, this is a Feature Request, for Pretty Printing. If you, or others, want to present some ideas, as a PR, in a forked branch, I would certainly review, and consider it, but obviously as a new option, which should default no, at least initially... or as a level, like vertical-level:0-5, or whatever... thanks...

LWillms commented 7 years ago

First on my initial post - <h1> thru <h9> are, of course, not real inline elements, but they are mostly one-liner, and do typically not include other elements. So I think they should be treated as <p> or <li> or similar. I did not mention those semantical structuring tags introduced by HTML 5, like <header>, <footer>, <article>, <aside>, <section>, and also the untyped <div> -- with those, I think it would be sensible to have them in a line by itself.

As to Geoffmcl's question about the <p>, here is the output of Tidy 5.2 with target HTML5, and vertical-space=no looks like this (lines cut at column 59):

    <h3 name="Kap_FÜNFTES" id="Kap_FÜNFTES">               
      FÜNFTES KAPITEL: Ökonomie in der Anwendung des konsta
    </h3>                                                  
    <h4 class="c1" name="Kap_5_I" id="Kap_5_I">            
      I. Im allgemeinen                                    
    </h4>                                                  
    <p>                                                    
      <a class="SeiteZurueck" href="me25_080.htm#S86">&lt;<
    </p>                                                   
    <p>                                                    
      <a class="SeiteZurueck" href="#S87">&lt;</a><a class=
    </p>                                                   
    <p>                                                    
      Eine ganze Reihe laufender Unkosten bleibt sich beina
    </p>                                                   
    <p>                                                    
      "Die Betriebskosten einer Fabrik bei zehnstündiger Ar
    </p>                                                   
    <p>                                                    
      Staats- und Gemeindesteuern, Feuerversichrung, Lohn v
    </p>                                                   

I would like to see it this way:

    <h3 name="Kap_FÜNFTES" id="Kap_FÜNFTES">FÜNFTES KAPITEL: 
    <h4 class="c1" name="Kap_5_I" id="Kap_5_I">I. Im allgemei
    <p><a class="SeiteZurueck" href="me25_080.htm#S86">&lt;< 
    <p><a class="SeiteZurueck" href="#S87">&lt;</a><a class= 
    <p>Eine ganze Reihe laufender Unkosten bleibt sich beina 
    <p>"Die Betriebskosten einer Fabrik bei zehnstündiger Ar 
    <p>Staats- und Gemeindesteuern, Feuerversichrung, Lohn v 

The Tidy configuration is this:

clean: yes
indent: yes
wrap: 0
break-before-br: yes
output-html: yes
doctype: html5
input-encoding: latin1
output-encoding: utf8
output-bom: yes
new-inline-tags: math, mroot, mrow, mi, mn, mo, msqrt, mfrac, 
 msubsup, munderover, munder, mover, mmultiscripts, msup, msub, 
 mtext, mprescripts, mtable, mtr, mtd, mth

The effect of an added vertical-space: yes of the example in the initial post looks like this:

        <li>                                                        
          <h2>                                                      
            Erster Abschnitt: Ware und Geld                         
          </h2>                                                     

          <ol>                                                      
            <li>                                                    
              <a href="me23_049.htm">Erstes Kapitel. Die Ware</a>   
              <ol>                                                  
                <li>                                                
                  <a href="me23_049.htm#Kap_1_1">Die zwei Faktoren d
                </li>                                               

                <li>                                                
                  <a href="me23_049.htm#Kap_1_2">Doppelcharakter der
                </li>                                               

                <li>                                                
                  <a href="me23_049.htm#Kap_1_3">Die Wertform oder d
                  <ol type="A">                                     
                    <li>                                            
                      <a href="me23_049.htm#Kap_1_3_A">Einfache, ein
                      <ol>                                          
                        <li>                                        
                          <a href="me23_049.htm#Kap_1_3_A_1">Die bei
                        </li>                                       

                        <li>                                        
                          <a href="me23_049.htm#Kap_1_3_A_2">Die rel
                          <ol type="a">                             
                            <li>                                    
                              <a href="me23_049.htm#Kap_1_3_A_2_a">G
                            </li>                                   

                            <li>                                    
                              <a href="me23_049.htm#Kap_1_3_A_2_b">Q
                            </li>                                   
                          </ol>                                     
                        </li>                                       

                        <li>                                        
                          <a href="me23_049.htm#Kap_1_3_A_3">Die äqu
                        </li>                                       

As one can see, there is a sense of the grouping I would like to see.

geoffmcl commented 7 years ago

@LWillms thanks for the additional comment...

Could you please avoid truncating the lines... they become useless as sample html to test...

You do know you can drag and drop files here, and github will upload the data, and present a link pointer for downloading, so we can get the complete files for testing...

Then do the same with the current tidy output, manually re-lining them to what you would like to see... and we have a full target files to match with...

Yes, I do think I see the grouping you would like... and this can be done by modification of the pprint.c module...

All we need is some C programmer, with the time, to take on the task, and present patches or a PR... I will help where I can... thanks...

balthisar commented 7 years ago

I'll take this up, but only if we start a 5.5 release (5.4 should be released soon, and there are some smaller issues that can still be addressed in order to release a 5.4). I would propose breaking past behavior and redefining what TidyIndent and what TidyVertSpace do, and probably roll TidyBreakBeforePR into a new set of configuration options that give better granularity over vertical space.

Feedback is appreciated, especially from people who depend on current behavior, because this is a current behavior breaker.

LWillms commented 7 years ago

Balthisar's proposals seem to be sensible for me, although it is not 100% clear what is meant by the " fine grained control" - I guess it is meant that eg VERTICAL-SPACE-HX = YES would mean that after both the <Hx> and the </Hx> would come a line feed, as it is now in 5.2, whereas with VERTICAL-SPACE-HX = NO the <Hx> and the </Hx> with the actual header in between would appear on one line, with a line feed only after the . Right? Similar with VERTICAL-SPACE-P -- its value being YES would mean that Tidy would work as today, whereas NO would result in the <P>, the </P> and the actual paragraph text in between appear on a single line, i.e. with a line feed only after the </P>. Right? If the default is YES or NO does not matter much to me when I can override the default by a configuration option.

BTW, with the TidyBreakBeforeBR I had an issue -- I tried it, and got the <BR> in front, but a line break after it -- I guess that Balthisar's proposed replacement VERTICAL-SPACE-BR would enable me to get the <BR> in front (left aligned with the text it breaks) but then followed by the rest of the text instead of a line break.

As to geoffmcl requests to upload some actual files instead of just showing the beginning of the lines. "me23_000.htm" is to replace http://www.mlwerke.de/me/me23/me23_000.htm but is not yet ready to replace it; I need to work on the CSS in order to differentiate the nesting levels not only by their indentation, but also by font, font size, weight, decoration etc. me23_033.htm is a relatively short ext (forword to the 2nd German edition). I try to upload both the intermediary version which I had produced with VB Script and regular expressions and what Tidy 5.2 made of it.


Tried, but it did not work.

LWillms commented 7 years ago

I'll try again the uploads...

It won't let me: "We don’t support that file type" -- TXT and ZIP file...

balthisar commented 7 years ago

@LWillms, your understanding is correct.

LWillms commented 7 years ago

Since the upload of files did not work, here some examples from the live Web site:

  1. a table of contents with definition list (DL/DT/DD) 7 kB
  2. a text with paragraphs and MathML 24 kB

Both formatted manually

I will still try to upload the larger TOC file with nested UL and OL.

LWillms commented 7 years ago

Here is a file with a nested list of <OL> and <UL>

produced by Tidy 5.2, i.e. not really in the form I like.

geoffmcl commented 7 years ago

@LWillms while it seems generally agreed by some Tidy could do with some TLC in the Pretty Printing vertical spacing department, it still requires someone to actually do the coding... Have not seen that yet...

The biggest problem seems to be that each have their personal preference... I personally generally have no problem with the current output, so that coder is unlikely to be me...

LWillms commented 7 years ago

I thought that @balthisar would be the one to do it, after all the valuable propasals he had made, especially since it was marked as scheduled for the 5.5 milestone.

Well, I could do with some postprocessing using RegExp, like this one for the <p> <p>\r\n +(.*)\r\n\s+</p>\r\n replace with <p>\1</p> (The \1 might be a special Ultraedit syntax).

With classes and the like in the p tag, it gets more hairy.

But getting it done by Tidy right away would be better.

geoffmcl commented 7 years ago

@LWillms I guess I really should try to stay out of this... since my strong view on backward compatibility is well known... early tidy loved vertical space...

First @balthisar proposal starts with propose breaking past behavior, and ends with a current behavior breaker...

My reply to that is simply! Why does it have to be that? Why can't we keep the current behavior, for perhaps many people, over some 17 years, that have their config files in place... why break it for them?

I have nothing wrong with enhancing the behavior, with new opt-in options, that create a different output.. that seems good... go for it... would help...

Then, although I have read this many times, I am still unclear of exactly what you want, prefer!

And as you point out it is not 100% clear what is meant by the "fine grained control"... are we going to have a new option for just about every element? What is included, excluded? What does a yes mean, what does a no mean...

Unfortunately, the large sample files you gave seem to be confusing... it seems for sure you want <li>....</li>, on the same line, but not if it contains other <ol> elements... similarly for many other elements, like <p>, <hN>, etc... but you do seem to want <div> separated by even an extra line...

Now that is your preference, no problem, and maybe having new options to do this would be good... and if @balthisar wants to take this on, then that too is good...

But on the other hand, I have no doubt others feel differently, including me...

So while there is a glimmer of an idea here to offer more fine-grained control over vertical space, it seems far from clear what exactly are we talking about...

The assigning a milestone says nothing about when it will be done... and will be moved it out if another releases approaches, and there is no agreed, coded, tested, solution...

Anyway, as stated, I will try to stay out of this preferences battle, so long as the current behavior is not broken... thanks...

LWillms commented 7 years ago

it seems for sure you want <li>....</li>, on the same line, but not if it contains other <ol> elements... similarly for many other elements, like <p>, <hN>, etc... but you do seem to want <div> separated by even an extra line...

On <li> ... </li> ... well, I want a line break before and after an <ol> or <ul> and the same before and after the closing </ol> or </ul>, because that wrapps a number of 1 to n <li> ... </li> elements, which could contain again <ol> or <ul> groups ... like the a case of nested lists, like the Russian puppet in the puppet.

Lists are by their nature recursive objects. I don't see that for paragraphs enclosed in <p> ... </p> pairs, or for headers h1 to h7. So I would like to see the <li> ... </li>, <p>...</p> and also the <hn>...</hn> tags on one and the same line with the text which they enclose, and a line break before and after.

OTOH, <div> ...</div> are block elements and typically recursive, same many of the symantic tags introduced with HTML5.

Or, what are text structuring tags.

BTW, for my self-written postprocessing as mentioned above, I found this among other "famous Python one-liners": tidy.exe | python -c "import sys,re;[sys.stdout.write(re.sub('PATTERN', 'SUBSTITUTION', line)) for line in sys.stdin]" for which I still have to work out the actual PATTERN and SUBSTITUTION formulas. I am not really at home with Python and not so fluent in "Regular Expressions".

PS: the "tidy.exe" in the above is already my substutution for [another command].

geoffmcl commented 7 years ago

@LWillms thanks for the further feedback... this is helping, me at least, to more understand your preference...

As stated it might really help if you construct some small samples to be used in testing... like I tried -

Input:

  <ul>
    <li>
      <a href="#one">List 1</a>
    </li>
    <ol>
      <li>
        <a href="#ord1">Ord 1</a>
      </li>
      <li>
        <a href="#ord2">Ord 2</a>
      </li>
    </ol>
    <li>
      <a href="#two">List 2</a>
    </li>
    <ol>
      <li>
        <a href="#ord3">ord 1</a>
      </li>
      <li>
        <a href="#ord4">ord 2</a>
      </li>
    </ol>
  </ul>

And using tidy default, with --show-body-only yes, and maybe with -w 0, if needed, due to length, I get the <li> output in one line -

Output:

<ul>
<li><a href="#one">List 1</a></li>
<ol>
<li><a href="#ord1">Ord 1</a></li>
<li><a href="#ord2">Ord 2</a></li>
</ol>
<li><a href="#two">List 2</a></li>
<ol>
<li><a href="#ord3">ord 1</a></li>
<li><a href="#ord4">ord 2</a></li>
</ol>
</ul>

And I do find it strange that just adding the -i, indent option will re-produce the multi-lined Input: version... hmmm...

But what this shows is that it would be relatively easy to add another option, of some name or another, to stop that newline being added to the <a ...> tags, if indent on... But as stated this would have to be an opt-in option...

That is allow others to have their current preference, and not break current behavior...

After all it make no difference to browser rendering... just in an editor view of the source... so is not a bug... just a preference... which I am sure maybe others might be interested in...

So I am sure it would help to setup small samples, showing input, current output with options used, and expected output... for each situation... it will be longer if you push the code developer to do this...

Yes, you may try to describe it using semantics like block, inline, etc, but it is difficult to agree what these words actually mean... especially when for example <p>, <h1-6>, ... are often described as a block elements, yet you seem to want them to be sort of output as a single line...

No particular idea about text structuring tags... except what I can read in a google search...

Concerning self-written post-processing, I would probably use Perl, but that is just because I am more familiar with it than Python... but I do not think it would be a one-liner ;=))

And in what you showed, I think the other command would be more like tidy -q [options] file.html | python ... the -q, quiet option, stops tidy outputting to errout, which should not be piped to python...

Good luck with this... will help where I can... thanks...

LWillms commented 7 years ago

And I do find it strange that just adding the -i, indent option will re-produce the multi-lined Input: version... hmmm...

Yeah, there are lots of unexpected side-effects.

The options have grown over the years, and it is quite a tangled shrubbery.

BTW, I don't know if HTML allows an <ul> or<ol> directly under an <ul> or<ol>, I think that this should rather be flagged as wrong. But I know that for a nested list, It would have to be rather like in my contribution 486#issue-205264448

Within the list items, i.e. inside of the <li> ... </li> one can have other lists declared by <ol> ... </ol>.

BTW, @geoffmcl , in which time zone do you reside? I'm on CET for Central European Time, currently with DST

LWillms commented 7 years ago

What I called "text structuring tags" may be better called "content structuring tags", tags like header, footer, article, section, aside, nav, etc, which were introduced with HTML5.

As sample I had proposed in an earlier contribution this TOC page at http://www.mlwerke.de/beb/beaa/beaa_000.htm which I have edited somewhat to make it look more like I want, and also corrected the error in the evocation of the CSS stylesheet. In its present state it has 237 lines of text in 27'989 bytes.

It has one <ul> and one top level <ol>, nested 3 levels deep in 30 chapters with several sub-chapters each, grouped in 5 sections in the second level. You may, of course, delete 3 of the 5 5 sections, and in the sections all chapters beyond 2.

As to Python vs. Pearl ... "intuitive" is what one is accustomed to, and as in the song "I say tomato, you say tomatoe" etc.

geoffmcl commented 7 years ago

@LWillms I live a little south of Paris, in Antony, France, which I just added to my profile, which is presently on CEST, or CEDT or ECST... which is probably the same as you, with summertime, DST...

OT: I understand, during the war, France took on German time, and after it was finished, never reverted. Logitude wise, France, and probably Spain, should probably be on UTC/GMT time, but then there is the interesting relationship between France and England, which probably does not help this case... and even more now with Brexit ;=))

I think you are correct that say a <ul> can not be the direct descendant of say an <ol>, and vice versa, and etc... these should be an error, at least according to the W3C validator... but they can be descendants of a <li>... ie nested lists...

Have not yet found the direct W3C docs/recs on this, but any pointers would be appreciated...

At present tidy does not bark at this! If you want to file a separate issue on this, with a small sample, that would be great... it is probably a bug, once the recs are found...

As far as this Pretty Print preference issue is concerned, you have not added much, except another link to a large file, which, as I have tried to indicate, is nearly useless as a testing sample...

But we are still waiting for @balthisar, or other coders, interested enough to step into the tangled shrubbery of options, LOL, and suggest a way forward... maybe with new opt-in options, but no breakage... thanks...

LWillms commented 7 years ago

Alors sur le RER B, et près d'Orly... I thought you were somewhere between California and Hawaii because of the little overlap in our hours of activity. I agree that France and even more Spain should be in Western European Time, but such a change would be highly polticised and today impossible. I wonder why the soo nationalist General de Gaulle did not order a return to the normal time for France right after the previous war. But maybe he wanted to prove his independence from England that time.

Coming back to our business. You wrote:

I think you are correct that say a <ul> can not be the direct descendant of say an <ol>, and vice versa, and etc... these should be an error, at least according to the W3C validator... but they can be descendants of a <li>... ie nested lists...

I also could not find any clear answer at w3schools.com or w3.com to the questions if a list grouping tag (<menu>, <ol>, <ul>) may have another child element other than <li>, but anything else doesn't make sense to me.

And a German book on HTML5 says so in its list of HTML tags: it gives as child elements to the three list grouping tags (<menu>, <ol>, <ul>) only <li>, and for <li> as the only possible parents those three list grouping tags. [Thomas Kobert: HTML 5. bhv, 2013. ISBN 978-3-8266-8187-5]

Concerning the sample file, I once tried to put somesuch here in this thread on Github, but the system refused. And the one which I pointed to is publicly accessible on the Web, and does work. Anybody can reduce it by deleting lines and leaving only maximal two items in each hierarchical level.

LWillms commented 7 years ago

OK, I have reduced the file in question, and try to upload it here. Doesn't work at all. Github refuses anything which only remotely smells like HTML, even as TXT and compressed in a ZIP. I'll try to paste it here:

<!DOCTYPE html>
<html>
  <head>
    <meta name="generator" content="HTML Tidy for HTML5 for Windows version 5.2.0">
    <title>Test file for nested lists - derived from http://www.mlwerke.de/beb/beaa/beaa_000.htm</title>
    <!-- 
    <link rel="stylesheet" type="text/css" href="../../css/inhalt.css">
    -->
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>

  <body>
    <h1 class="c3">Testing nested lists</h1>
    <ul>
      <li><a href="test_001.htm">Vorrede zur fünfundzwanzigsten Auflage</a></li>
      <li><a href="test_019.htm">Vorrede zur vierunddreißigsten Auflage</a></li>
    </ul>
    <ol class="c5" start="0">
      <li><a href="test_025.htm">Einleitung</a></li>
      <li><h3>Erster Abschnitt
              <br>FIRST SECTION</h3>
        <ol class="c5" start="1">
          <li><a href="test_035.htm">Erstes Kapitel. Die Stellung der Frau in der Urgesellschaft</a>
            <ol>
              <li><a href="test_035.htm">Hauptepochen der Urgeschichte</a></li>
              <li><a href="test_035.htm#Kap_1_2">Formen der Familie</a></li>
            </ol>
          </li>
          <li><a href="test_054.htm">Zweites Kapitel. Kampf zwischen Mutterrecht und Vaterrecht</a>
            <ol>
              <li><a href="test_054.htm">Das Aufkommen des Vaterrechts</a></li>
              <li><a href="test_054.htm#Kap_2_2">Anklänge an das Mutterrecht in griechischen Mythen und Dramen</a></li>
            </ol>
          </li>
          <li><a href="test_082.htm">Drittes Kapitel. Das Christentum</a></li>
        </ol>
      </li>
      <li><h3>Zweiter Abschnitt<br>
          SECOND SECTION</h3>
        <ol class="c5" start="7">
          <li><a href="test_125.htm">Siebentes Kapitel. Die Frau als Geschlechtswesen</a>
            <ol>
              <li><a href="test_125.htm">Der Geschlechtstrieb</a></li>
              <li><a href="test_125.htm#Kap_7_2">Ehelosigkeit und Selbstmordhäufigkeit</a></li>
            </ol>
          </li>
          <li><a href="test_134.htm">Achtes Kapitel. Die moderne Ehe</a>
            <ol>
              <li><a href="test_134.htm">Die Ehe als Beruf</a></li>
              <li><a href="test_134.htm#Kap_8_2">Der Rückgang der Geburten</a></li>
            </ol>
          </li>
        </ol>
      </li>
    </ol>
    <hr>
    <p>Final remarks, out of list</p>
  </body>
</html>

`

OK, here it goes.

LWillms commented 7 years ago

Pretty printing is of course only for human consumption and does not (or should not...) modify the syntactical correctness of an HTML document. The best for the Web browser is actually having the HTML document as a single byte stream, not interrupted by line breaks, as currently is produced by Tidy with the option vertical-space: auto.

Those who edit HTML files not "by hand" but by some WYSIWYG HTML editor which works more like a compiler actually hiding what they consider "machine code", don't need prettyprinting of HTML files. Example: MS Word for Windows saving a document in HTML instead of Microsoft's current file format for office documents.

Or think they don't -- but sometime one has to inspect the resulting code, if the suspicion arises that the HTML editor might have produced not what the author intended.

I for my part like to work directly on the HTML code, helped, if possible by HTML sensitive editors providing syntax highlighting and bracket grouping, possibly automatic syntax completion and drop down menus of possible or allows next items.

I want my HTML code very compact with <p> and <li>and <Hn> elements being in one line, and indented so that I can see the structure of the document. At the same time, I use editors like Ultraedit or (temporarily) also PSPad, which fold those long lines on the CTRL-W command, so that I have all their text in the window and can easily change also words of the very end of such a long <P> or <li> or any other such element (<td>, <dt>, <dl> come to my mind).

The above is not exactly how Tidy did produce it. I have edited the file to reflect my style of an HTML file.

LWillms commented 7 years ago

This is what Tidy makes out of the above in using this config file:

output-html: yes
doctype: html5
input-encoding: latin1
output-encoding: utf8
output-bom: yes
clean: yes
indent: yes
wrap: 0
vertical-space: no
new-inline-tags: math, mroot, mrow, mi, mn, mo, msqrt, mfrac, 
 msubsup, munderover, munder, mover, mmultiscripts, msup, msub, 
 mtext, mprescripts, mtable, mtr, mtd, mth
<!DOCTYPE html>
<html>
  <head>
    <meta name="generator" content="HTML Tidy for HTML5 for Windows version 5.5.31.w32-vc10">
    <title>
      Test file for nested lists - derived from http://www.mlwerke.de/beb/beaa/beaa_000.htm
    </title><!-- 
    <link rel="stylesheet" type="text/css" href="../../css/inhalt.css">
    -->
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <h1 class="c3">
      Testing nested lists
    </h1>
    <ul>
      <li>
        <a href="test_001.htm">Vorrede zur fünfundzwanzigsten Auflage</a>
      </li>
      <li>
        <a href="test_019.htm">Vorrede zur vierunddreißigsten Auflage</a>
      </li>
    </ul>
    <ol class="c5" start="0">
      <li>
        <a href="test_025.htm">Einleitung</a>
      </li>
      <li>
        <h3>
          Erster Abschnitt<br>
          FIRST SECTION
        </h3>
        <ol class="c5" start="1">
          <li>
            <a href="test_035.htm">Erstes Kapitel. Die Stellung der Frau in der Urgesellschaft</a>
            <ol>
              <li>
                <a href="test_035.htm">Hauptepochen der Urgeschichte</a>
              </li>
              <li>
                <a href="test_035.htm#Kap_1_2">Formen der Familie</a>
              </li>
            </ol>
          </li>
          <li>
            <a href="test_054.htm">Zweites Kapitel. Kampf zwischen Mutterrecht und Vaterrecht</a>
            <ol>
              <li>
                <a href="test_054.htm">Das Aufkommen des Vaterrechts</a>
              </li>
              <li>
                <a href="test_054.htm#Kap_2_2">Anklänge an das Mutterrecht in griechischen Mythen und Dramen</a>
              </li>
            </ol>
          </li>
          <li>
            <a href="test_082.htm">Drittes Kapitel. Das Christentum</a>
          </li>
        </ol>
      </li>
      <li>
        <h3>
          Zweiter Abschnitt<br>
          SECOND SECTION
        </h3>
        <ol class="c5" start="7">
          <li>
            <a href="test_125.htm">Siebentes Kapitel. Die Frau als Geschlechtswesen</a>
            <ol>
              <li>
                <a href="test_125.htm">Der Geschlechtstrieb</a>
              </li>
              <li>
                <a href="test_125.htm#Kap_7_2">Ehelosigkeit und Selbstmordhäufigkeit</a>
              </li>
            </ol>
          </li>
          <li>
            <a href="test_134.htm">Achtes Kapitel. Die moderne Ehe</a>
            <ol>
              <li>
                <a href="test_134.htm">Die Ehe als Beruf</a>
              </li>
              <li>
                <a href="test_134.htm#Kap_8_2">Der Rückgang der Geburten</a>
              </li>
            </ol>
          </li>
        </ol>
      </li>
    </ol>
    <hr>
    <p>
      Final remarks, out of list
    </p>
  </body>
</html>

The same identation, but with each <li>...</li> occupying 3 instead of only 1 line, it is not so easy to read.

geoffmcl commented 7 years ago

@LWillms wow, my inbox is flooded with emails - 23:21. 23:23, 23:24, 23:26, 23:26, 23:28 - What are you doing?

All seemingly a repeat... Please try to avoid that...

Ok, we get your preference is that an <li>...</li> be output on the same line, even if -i is used... we get that, we understand!, and a very simple few line sample would show that... which you have now learned, can be pasted inside three back ticks, in the github markdown system...

Read Styling with Markdown is supported... or other references...

So, I am sorry, nothing new added here, except more posts... more noise... no problem...

Also, please understand, your first post is what reaches our inbox, to alert us there is a post, but subsequent edits to that post are not dispatched... the only way we see the final is through adressing issues directly...

It is preferred you prepare your posts in an editor, like me, then copy-paste it to issues, and use preview to read, see how it will look... sometimes several cycles until I get it right...

Of course you do now reveal your config used, and I have several comments on that -

output-html: yes # is the default, not needed
doctype: html5  # is the default, not needed
input-encoding: latin1  # ok, interesting...
output-encoding: utf8   # is the default, not needed
output-bom: yes # wow, really! With output utf-8 no BOM needed, but ok!
clean: yes  # ok
indent: yes # ok
wrap: 0     # ok
vertical-space: no  # is the default, not needed
new-inline-tags: math, mroot, mrow, mi, mn, mo, msqrt, mfrac, 
 msubsup, munderover, munder, mover, mmultiscripts, msup, msub, 
 mtext, mprescripts, mtable, mtr, mtd, mth  # the sample has none of these, but I think most now supported by default, but ok...

Moving forward... thanks...

LWillms commented 7 years ago

Sorry about your inbox... but this Github system is quite strange. I hit the green "Comment" button, and nothing happened. My comment did not appear. So I tried again, and again, until giving up. Later I saw on my tablet that several instances of my comment were actually there, so I went back to my desktop, found this time all those copies, and deleted all of them except the last one.

Wordpress does detect such a problem and warns the user that he/she apparently is sending a duplicate.

On the config file: the BOM is needed for the old version of Ultraedit and maybe some other editors so that they recognize the file as being UTF-8 encoded.

For the rest, I like to have the options spelled out. Its in the file which is evoked on every instance of Tidy, so does not disturb calling Tidy. I had the new options in there which Balthisar proposed, but Tidy would flag them as unknown options.

Can I mark lines in the config file as comments? I tried the semicolon ";", but that was flagged as wrong too.

It is a pity that the option "input-encoding" is hidden deep in the documentation of the config options, and not shown on the first level of help, since converting non-UTF8 texts to UTF8 does not work correctly if I would simply tell Tidy that we are dealing with UTF-8. Tidy then expects the input being in UTF8, too, and produces wrong output.

geoffmcl commented 7 years ago

@LWillms yes, I too, on very rare occasions, can get a 10-30 second delay, before my comment appears - almost as if github.com server is asleep! - so one must learn to be patient... but no problem... you did clean up the list... thanks...

For some strange historic reason the comment lines in config commence with //... you can see this in tidy_config... scroll down to the Sample... which looks like the sample you started with...

In reading the source config.c it also skips lines commencing with #, which seems undocumented as far as I can find...

Yes, you may add as many config items as you want... I even had one person recently suggest having a sample with them all, as a sort of reminder... but rejected that, as more upkeep... and you can get a current alphabetic list with -show-config...

I was merely try to indicated that default items are not needed, nothing more... just trying to be helpful...

And glad you now understand tidy can have different input and output encoding... the default being utf-8...

The choice of what is in the so called first level of -h is always difficult, and generally have a copy of the -help-config output handy, and use the quickref frequently... it is always a learning process, even for me...

HTH

LWillms commented 7 years ago

Thanks for the comment line marker with the double slash. // is also valid in some programming languages.

BTW, having config options in the config file with their default values is also useful for testing the result of different settings. One has to change only the value of the option, not put it in competely. And I have those options permanently in the config file which I might possibly change.

Being able to specify the input-encoding is very important when using Tidy to upgrade old HTML documents, as I currently do. In the Tidy versions up to 2009, I could never find that and because of that was hampered in doing the necessary upgrades.

geoffmcl commented 7 years ago

@LWillms as you may know, with current Tidy, the option -show-config will output the default option list, as did the Tidy version 2009...

And it was interesting to compare the input, output encoding change -

2009:
input-encoding              Encoding   latin1
output-encoding             Encoding   ascii
2017:
input-encoding              Encoding   utf8                                    
output-encoding             Encoding   utf8

With a little bit of perl scripting, I was able to massage this -show-config list easily into a full default config file for tidy... could post that tidy-conf.pl somewhere, if interested... in fact just added it to http://geoffair.org/tmp/tidy-conf.pl.zip ...

Usage:

tidy -show-config > temptidy.conf
perl -f tidy-conf.pl temptidy.conf

And it should write a full tidy-def.conf file locally...

Perhaps you already know, you can use the environment variable HTML_TIDY to always load such a default config... before the command line is processed... so it can be over-ridden on the command line, or in another -config new.conf option...

set HTML_TIDY=/path/to/tidy-def.conf

There are so many ways of using granddaddy tidy ;=)) but we are getting way off the topic of this issue...

Who is interested in looking at this Pretty Print preference issue? Thanks

geoffmcl commented 6 years ago

Although no further comments in many months, and no one stepping up to do the coding, maybe this is still open, so moving out the milestone...