htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.71k stars 418 forks source link

Prevent tidy from quoting attributes #439

Open marcoscaceres opened 8 years ago

marcoscaceres commented 8 years ago

There are cases in HTML5 where one wants to simply put the name of an attribute in a tag, and for it not to be set to anything (specially for proprietary attributes). It would be useful if there was an option to honor this behavior so that, for instance:

<pre>
  <a idl>x</a>
</pre>

Does not become:

<pre>
  <a idl="">x</a>
</pre>

If there is already some way to do that, apologies. I was not able to find an option for it.

geoffmcl commented 8 years ago

@marcoscaceres, thanks for the comment... I searched, but did not find any option for this...

If seems if the attribute is not marked as a boolean type, ie has CH_BOOL bit, like say nowrap, checked, ... etc, that ="" will unconditionally be added, if no value given...

So it seems a new option would be the only way to go to inhibit this behavior...

There is a README OPTIONS.md, which enumerates the basic four steps of a new option, the last just being its use in the code...

But would ask you, or others, to comment on the use case for this... thanks

marcoscaceres commented 8 years ago

But would ask you, or others, to comment on the use case for this...

I'm trying to use tidy with a HTML-like markup language that we use to produce specs at the W3C (using a tool called BikeShed).

BikeShed allows unquoted proprietary attributes to appear in markup (e.g., <pre idl>, where idl means the contents of the pre is to be treated as "WebIDL" - which is a W3C thing for describing JavaScript bindings). There are several of these little attributes, and more could be added to BikeShed over time.

Anyway, our primary use case is for tidy to leave those attributes alone and not quote them - so that when we git diff after Tidy, those attributes won't have changed. We are primarily interested in tidying the content of elements.

Use case/problem is described in a bit more detail here: https://github.com/tabatkins/bikeshed/issues/662

geoffmcl commented 7 years ago

Since we are about to release 5.4, and nothing has been done about this Feature Request, can only move the milestone to next 5.5.

As always, further comments, patches, PR very welcome... thanks...

fyiman commented 7 years ago

@marcoscaceres - It sounds like you want a way to classify your attribute such as "idl" as a boolean type. I don't know if that's already possible, but it seems to me to be a way to do what you want.

marcoscaceres commented 7 years ago

yeah, essentially. If there was a way of saying, "these attributes are boolean attributes" in the config, that would be great. Don't consider this high priority, however.

geoffmcl commented 7 years ago

@fyiman yes, tidy does have a way to classify an attribute as boolean, like the above mentioned nowrap, but that only exists for known, allowed attributes, as enumerated within tidy code... but this is a proprietary attribute...

And @marcoscaceres as you point out there is presently no way of saying this idl is boolean... although we can suppress the warning with --warn-proprietary-attributes no...

But I did think of one thing. We have an option literal-attributes, which defaults to no. I experimented with the following small patch -

diff --git a/src/pprint.c b/src/pprint.c
index 68e8d44..8f84b8d 100644
--- a/src/pprint.c
+++ b/src/pprint.c
@@ -1118,6 +1118,11 @@ static void PPrintAttrValue( TidyDocImpl* doc, uint indent,
     if ( delim == 0 )
         delim = '"';

+   if (value && (value[0] == 0) && cfgBool(doc, TidyLiteralAttribs))
+   {
+       return; /* Issue #439 - If '--literal-attributes yes', and value is null, no '=""' */
+   }
+
     AddChar( pprint, '=' );

     /* don't wrap after "=" for xml documents */

This effectively says, if the attribute value is zero in length, and the user has requested --literal-attributes, then treat this as a boolean type... does not seem to stretch the meaning too much... but would probably need to adjust the docs to mention this behaviour...

I have already run the full regression tests with this patch, and seems no problem... probably just because there are no examples in the tests, since dropping the ="" is certainly a change!

What do you guys, and others, think? thanks...

balthisar commented 7 years ago

@geoffmcl, this is a perfectly reasonable change.

balthisar commented 6 years ago

@geoffmcl, @marcoscaceres, I've added PR #654 that address this, as it was quick and dirty. Instead of hijacking literal-attributes, though, it was simple to add boolean-attributes instead, so this does this.

But, I don't like this (or @geoffmcl's original proposal). It limits you to an all-or-nothing approach for attributes without values, and there's a big difference between a boolean attribute, and an attribute without a value.

As-is (and in the heart of it, I only implemented @geoffmcl's patch), if I specify in the source document a null attribute, it's still turned into a boolean attribute, and so I regard this new option as destructive.

As we're getting close to a planned 5.6.0 release, I wonder if we can simply close the PR, and leave this issue open. I would propose doing this, instead, although additional recommendations would be welcome:

Right now, today, Tidy already is compliant with the second form, i.e., booleans are given the valid null string value. Browsers and other user agents should not break on this, because it's a correct use of boolean attributes. Thus leaving them naked is only a pretty printing preference. If other scripts or tools break when a null string is found, then that script/tool is not compliant with the standard.

Edit: And we have to be mindful of XHTML, by the way.

marcoscaceres commented 6 years ago

I'm ok with leaving this open and taking a different approach after 5.6. I'm not sure what the right approach is, but I'm liking a combination of what you are proposing above, particularly points 1 and 3 somehow being merged together.

geoffmcl commented 6 years ago

@balthisar wow, you have certainly expanded the scope of the original @marcoscaceres request - read the title...

I thought my patch did exactly what was required, and more to the letter of the literal-attribute description, namely to preserve what was there, in that if there was no value, then do not add =""... not exactly hijacking, more like making it do what it promises...

And note, I think this PPrintAttrValue, where my patch is, is only called if the attribute is not marked boolean... need to check more, verify, but that was what was expected... maybe XHTML???

I guess I made the mistake of referring to this as a boolean type attribute... It is not! It is an attribute that literally has no value, and suggested a way tidy could honor that fact, with an existing option...

I agree with "there's a big difference between a boolean attribute, and an attribute without a value."... never meant to suggest differently... sorry if otherwise was implied...

Three Other Issues

But this was not meant to mess with what I think is a completely separate issue - should we offer an option for truely marked boolean atributes, to be output as say naked, the default, null, add ="", or named? I have not read any request for this. Maybe I missed it... Is there a use case?

And has nothing to do with another completely separate issue - does tidy have all boolean attribute? Or is there a need to add another new option, like new-boolean-attributes? Maybe, but again nothing to do with this issue...

AND, this has nothing to do with what I would also consider a separate tidy bug issue, namely it will not flag a warning on say disabled="false"... it should, and it should fix it... Need to open a issue, unless someone beats me to it...

So, aside from the above three separate, generally unrelated issues, do not agree with PR #654...

Even find the comment description as false, /**< Treat attributes without values as boolean */. There was no such intention in the patch... thus also do not like the name boolean-attributes... maybe more like null-attribute-values, or something...

And also not sure of booleans are given the valid null string value. in current tidy. Only had time to quickly test my new sample in_439-1.html, but it seems checked was output as naked.

It is also noted the validator flags the disabled="false" error, which tidy overlooks, and even outputs it as the same... UGH! As stated, need an issue for this obvious bug...

Do agree we move this out of pending release 5.6! And hope the PR is removed/closed... thanks...

balthisar commented 6 years ago

@geoffmcl, you're right, a lot of those new issues, but they're raised by looking at how the patch works. With your patch:

<mytag id="" name="" gh> becomes <mytag id name gh>, which addresses the feature request, but is destructive to the id and name attribute, which are supposed to be attributes with no values, and we've turned them into boolean attributes, which are not the same thing. It's also not honoring "literal-attributes" because now they're not literal; we've dropped the empty values.

If want to engineer this in a logical manner, then we do have to consider how we handle boolean attributes overall, thus the related bug being mention, and the overall expanded scope. Like so many things with Tidy, a seemingly simple change has unintended consequences.