htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.71k stars 418 forks source link

Wrap doesn't align attributes #508

Open allsey87 opened 7 years ago

allsey87 commented 7 years ago

I'm not sure if this is a bug or a feature request but currently if I have the following tag:

<tag attributea="valuea" attributeb="valueb" attributec="valuec"></tag>

It will wrap as follows if I add the option wrap: 50

<tag attributea="valuea" attributeb="valueb" 
attributec="valuec"></tag>

Alternatively, if I instead add the option indent-attributes: on I get

<tag attributea="valuea"
     attributeb="valueb"
     attributec="valuec"></tag>

What I want and what I think should be the default behavior for the wrap: 50 option, is this:

<tag attributea="valuea" attributeb="valueb"
     attributec="valuec"></tag>

I checked out, built and inspected the code with GDB to see if I could quickly implement this myself. I can see that the indenting is done in a loop by inserting either a tab or a space character after all call to GetSpaces which returns the value pprint->ident[0]. I think I just need to make the smallest of changes to either PPrintAttrValue or PPrintAttribute, but I really have no idea how to proceed or whether this should be done in a different part of the code entirely.

If this isn't a bug and the strange default alignment of attributes is desirable, perhaps indent-attributes could be changed to an enum option with: on, off, and afterwrap possibilities?

geoffmcl commented 7 years ago

@allsey87 thank you for your issue, which I think at this time would be a Pretty Print Feature Request

For your examples it would be good to show the full config passed to tidy. I assume -xml, but still get warnings about the comma, ,? How do I get rid of this?

You seem to be searching in the correct code, pprint.c, in PPrintAttrValue or PPrintAttribute, but I am not sure I exactly understand what you want to happen... except it seems you want the attribute indented, even without the --indent-attributes on, or something... You need to explain a little more here...

Maybe it would be possible change indent-attributes to an AutoBool, which then has 3 states - auto, yes',no`, and have different actions for each state...

Anyway, at this time seek more information... and of course patches or a PR to show what you want more exactly... thanks...

allsey87 commented 7 years ago

Hi @geoffmcl,

Sorry about the commas, I accidentally added them when typing up the issue. Here is a complete example alongside my configuration files:

Unformatted input

<?xml version="1.0" ?>
<my-example>
<anode attributea="valuea"
attributeb="valueb"
attributec="valuec"
attributed="valued"/>
</my-example>

Configuration 1

indent: yes
indent-attributes: no
indent-spaces: 4
wrap: 72
input-xml: yes
output-xml: yes

Output 1

<?xml version="1.0"?>
<my-example>
    <anode attributea="valuea" attributeb="valueb" attributec="valuec"
    attributed="valued" />
</my-example>

Configuration 2

indent: yes
indent-attributes: yes
indent-spaces: 4
wrap: 72
input-xml: yes
output-xml: yes

Output 2

<?xml version="1.0"?>
<my-example>
    <anode attributea="valuea"
           attributeb="valueb"
           attributec="valuec"
           attributed="valued" />
</my-example>

What I am after is similar to Output 1, however with the strings attributea and attributed aligned in the output. That is, I would like the output of Configuration 1 to produce this:

Pseudo Output 1

<?xml version="1.0"?>
<my-example>
    <anode attributea="valuea" attributeb="valueb" attributec="valuec"
           attributed="valued" />
</my-example>

I have the repo checked out and compiling, so if you can give me some hints on how to change this behavior, I can implement and test them as well as submitting patch.

geoffmcl commented 7 years ago

@allsey87 ah, thank you for clarifying... I now think I understand...

I too have long considered this missing an indent bump after a wrap as very close to a bug!

And this does not only apply to xml, but there is the same problem with html, like the following with default config plus -i --show-body-only yes...

<div class="div_class">
<p class="paragraph_class" id="unique_id" title="About this paragraph" tabindex="1">
paragraph 1</p>
</div>

Will output:

<div class="div_class">
  <p class="paragraph_class" id="unique_id" title=
  "About this paragraph" tabindex="1">paragraph 1</p>
</div>

I too think that "About... would look better if indented, 2 spacees in this case...

Now doing this may be quite difficult, in that, of course if it wraps to a 3rd or more lines, then we can not add more indents. And in this case, when the paragraph is finished, we have to drop the indent, and any extra secondary indent added... On the other hand, maybe this will be quick and easy...

Maybe what is applied to xml output could also be applied to html output... they share some pprint services...

This must also work if the user chooses to use tabs in place of spaces, and we have a long outstanding bug, #335, concerning too many tabs being added... And may be related to #403, preserving the tab space in a <pre>...

As you will learn, tidy first of all accumulates the output in an internal buffer, the pprint->linebuf, in AddC, AddAsciiString, ... and only when this buffer is flushed is wrap and indent applied, and the actual output happens, WriteChar...

We could even extend, add to, the TidyPrintImpl and/or the TidyIndent structures, to carry, preserve more information. This is carried in the doc structure passed to just about every pprint function...

If you have the htacg repo checked out, you might also consider cloning the repo into your own github space, and check that out. Then you can test and experiment, in say an issue-508 branch, and push things to your own repo. I would then be able to checkout your repo, and test, suggests, etc... and when all done you can present a PR, which preserves your name on the commits... but patches are also fine...

It would be very much appreciated if you could look into this, and I will help where ever I can... thanks...

allsey87 commented 7 years ago

Hi @geoffmcl,

I will check out the bugs you mentioned and clone / create a branch for this bug. One question though, it seems that the doc structure already has enough information to do the indentation as demonstrated by output 2 in my last message, is this not the case?

If I remember correctly, pprint->ident is an array of size 2. I suspected that pprint->ident[1] or pprint->ident[0] would have contained the indentation information that I needed and I could simply use this value in a loop to create the correct amount of indention. However, if I remember correctly, pprint->ident[1] was just set (initialized) to -1.

It seems that enabling the option indent-attributes causes the tidy to take different routes through the application (based on my back traces with gdb), so I will need to have a look closer at this.

geoffmcl commented 7 years ago

@allsey87 yes, the doc structure contains indent information, and maybe this is sufficient right now... but just saying should you need more...

And yes, you need to use information from pprint->ident[2], to re-generate the current required indent, which it does now fine, but after there has been a wrap, need to add one more... the trick will be to know a wrap has happened, and output the indent amount one more time...

Yes, enabling indent-attributes causes a different code path...

In PPrintAttribute, you can see indAttrs = cfgBool( doc, TidyIndentAttributes );. It gets some extra xtra = AttrIndent( doc, node, attr );, added to indent if not first attribute...

Now we want this extra also applied after a wrap, and indAttrs == no...

So I have corrected your sample - https://github.com/geoffmcl/tidy-test/blob/master/test/input5/in_508.html - removed the commas, and using config -xml -w 50 --show-info no F:\Projects\tidy-test\test\input5\in_508.html started testing...

The problem in this case is the wrapping that takes place after ...="valueb" has been added to the print buffer, this is flushed, and that leaves just attributec="valuec"></tag> tail in the internal print buffer...

So when the print buffer is finally flushed, the current code does not look at what happened, and thus does not add any indent to this...

The wrapping happens in CheckWrapIndent, when the next space/indent count, GetSpaces, plus the print buffer contents, pprint->linelen, is greater than or equal to the configured wrap length, cfg(doc, TidyWrapLen)... it seems we should store this yes, that we wrapped, somewhere...

Now we have effectively, in this case, finished printing the tree of elements, and then we have the final TY_(PFlushLine)( doc, 0 );... Note it is called with indent == 0... but that does nothing here...

Then in PFlushLineImpl the WantIndent will return no... but even if we could somehow enter the indenting loop, because we do want to indent this last content, it would still fail because GetSpaces would also return 0, from pprint->indent[ 0 ].spaces...

That is why I suggest we may need to keep the fact that the current print buffer contents are after a wrap, and thus require an indent... if you see what I mean...

Maybe I need to construct a bigger sample, or use -w 30, that requires wrapping to 3 or more lines... code added to ensure the 2nd, 3rd, etc lines are correctly indented, could also be used to cause this final flush to add an indent... or something... that is an output -

<tag attributea="valuea"
attributeb="valueb"
attributec="valuec"></tag>

can be what I think we would like to see -

<tag attributea="valuea"
  attributeb="valueb"
  attributec="valuec"></tag>

Will work on this now and then... but do not see an easy answer yet... maybe you will find something... thanks...

geoffmcl commented 7 years ago

@allsey87 Just pushed some tests to the issue-333 branch, trying to solve #335 and #333 ...

The interesting thing added is a service to output the indent_char... It is this service that would add the indent to subsequent content lines, just before they are flushed to output...

It is now a question of figuring out the signal, the sequence of events, flags, markers, values, that would allow this to be called, when internal print buffer is flushed... maybe a step forward?

allsey87 commented 7 years ago

I spent a couple hours looking at this issue. I should point out that what I am looking for is slightly different to what @geoffmcl wants. That is, I'm looking to indent the second line of attributes so that it is vertically aligned with the first attribute on the first line. In contrast, I think @geoffmcl is looking to indent the second, third line etc constantly by a fixed number of spaces or tabs.

The actual amount of identation that I need is easily obtainable by calling AttrIndent( doc, node, attr ) inside PPrintAttribute. This is value is stored in the xtra variable, however, there is some hack at pprint.c on lines 1239-1240 which is setting this value to zero when the ident-attributes is disabled in the config. By commenting out this hack and adding xtra to ident on line 1263, I am able to approximately obtain what I want, however, it fails at times depending seemingly on how long the line is and where the wrapping occurs.

geoffmcl commented 7 years ago

@allsey87 did not know we had this slight difference, but it seems just in the amount of extra indent... I am ok with either amount... aligning all subsequent wrapped lines to the first attribute is also fine...

You are certainly reading the right area of code in pprint.c... at least for the attribute output...

Whether the person that added /* fix for odd attribute indentation bug triggered by long values */ thought of it as a hack or a fix does not really matter... we could probably dig back in git/cvs history and find who, and maybe more on why... I hope it was not me... I am usually quite verbose in my comments...

So what exactly is the problem with how long the line is and where the wrapping occurs? Can this too be fixed or hacked, which ever description you prefer?

And what about the final eventual flush, where both indent and xtra are ignored... now outside printing attributes...

Will try to get around to some more testing... but thanks for looking deeper into this... maybe you'll find something... thanks...

jan-matejka commented 7 years ago

I'd like this feature as well.

geoffmcl commented 6 years ago

Although there have been no recent comments it does seem like a reasonable Pretty Print Feature Request... look forward to further feedback... but meantime moving out the milestone... thanks...