htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.72k stars 420 forks source link

tidy complains about <a name="x"> and <div id="x"> being duplicate IDs #786

Open petdance opened 5 years ago

petdance commented 5 years ago
$ cat foo.html
<!DOCTYPE html>
<html lang="en-US">
    <head>
        <title> Test </title>
    </head>
    <body>
        <a name="content"></a>
        This &lt;a&gt; tag is for bookmarking.

        <div id="content">
            This &lt;div&gt; is for semantic markup.
        </div>
    </body>
</html>
$ tidy -q -e foo.html
line 10 column 9 - Warning: <div> anchor "content" already defined

The id in the <div> is seen as a duplicate of the name in the <a>. I don't think that that is actually a conflict.

I put this HTML into https://validator.w3.org/nu/ and it only warned that "The name attribute is obsolete. Consider putting an id attribute on the nearest container instead."

I understand that the <a name> is obsolete, but I still don't think that it should be seen as conflicting with a <div id> of the same value.

geoffmcl commented 5 years ago

@petdance thanks for the issue... quite interesting... sort of exposed conflict...

In researching this, I can see one of the actions of libTidy is to clean and repare, C&R, a document, and one of those actions is to examine each anchor node, in TY_(FixAnchors)(doc, &doc->root, wantNameAttr, yes);, which specifically checks for name and id, and if it finds just name, adds id as well...

Setting the option anchor-as-name: no will drop the name, but leave the potentially libTidy created id...

Hence the exposed created conflict... which libTidy reports... with minimal info... docs improvement?

And it has been doing this since some time before the last cvs release in 2009... but, as always, a long history does not make the action right... has it been discussed before... can't find ATM...

If this is considered an ongoing problem, it would seem the only answer would be to add an option, to specifically inhibit, or modify, this particular C&R action... is there a use case...

Looking forward to further feedback... thanks...

petdance commented 5 years ago

I don't know that there's a "use case". I just see a bug.

I think that it's incorrect to conflate <a name> and <div id>. In the example above, tidy tells me that I have a duplicated ID, but there is no duplicated ID. https://validator.w3.org/nu/ does not complain about a duplicated ID.

I don't know what else there is to say about it.

geoffmcl commented 5 years ago

@petdance sorry, what exactly is the tidy bug?

Of course nu will only mention the use of name if given the input... so...

There is no conflation of anything...

As stated, tidy chooses to fix the problems it sees in the anchor? Is that the bug you see? Tidy should not fix things, or something...

Given your input, at present the default tidy output, with -w 0 is, in part is -

<a name="content" id="content"></a> This &lt;a&gt; tag is for bookmarking.

If you add --anchor-as-name no it will be -

<a id="content"></a> This &lt;a&gt; tag is for bookmarking.

Is there some error you see in this output?

Do you want to instruct tidy to inhibit, modify its fix, of the anchor? If inhibit why? If modify, to what?

Because, if left unchanged, and the document also then contains say <div id="content">, then tidy will report Warning: <div> anchor "content" already defined...

Maybe could do better with the docs/description here, like explicit, or even created in c&r, or something... somehow making it clear somehow this is a created situation... only in existing in the output...

As will the nu validator, if run on the tidy results...

Am I misunderstanding something here? Please explain... thanks...

petdance commented 5 years ago

I am not using tidy to modify HTML. I am only using it to validate existing HTML. I am only using the errors & warnings.

The problem is that when I run

$ tidy -q -e foo.html

I get this warning:

line 10 column 9 - Warning: <div> anchor "content" already defined

Tidy is telling me that there is already an ID called "content", but that's incorrect. There is only one ID called "content".

The results are the same if I add --anchor-as-name yes or --anchor-as-name no.

geoffmcl commented 5 years ago

@petdance ah, your use of tidy... I am only using the errors & warnings.... so...

line 10 column 9 - Warning: <div> anchor "content" already defined

Here, in reading the input, and studying the output, this warning warns you of TWO potential problems... but the second only exists after a fix has been applied to the first...

The first, use of name in an anchor.

Tidy found <a name="content">, so added id="content", as a fix, it has done for ages, and could drop the name with --anchor-as-name no...

This is a so called silent fix... I am also somewhat ambivalent on these... in general, always feel tidy should spell out what it is doing...

So yes, tidy could add a warning something like - In anchor, found obsolete "name", and added "id". See option "anchor-as-name" -

Is that what you are suggesting? That would certainly help in understanding what follows in this case...

Then the second potential problem. Oops, you already have <div id="content">... so this clashes with the tidy anchor fix... Need author to choose, and fix input document...

I do not think some argument like - I want to use obsolete name="content" on an anchor so it does not clash with <div id="content"> - is a valid html request... one or the other has to be changed... IMO...

Really, I am just trying to find what tidy change you want... thanks...

petdance commented 5 years ago

Tidy should not give me warnings about anything other than the file that I've given Tidy to check. If there's not a problem in the original file, then Tidy should not give me a warning.

What I understand is that tidy takes this:

<div id="x"><a name="x">

and turns it into this

<div id="x"><a id="x">

and then gives me a warning based on the transformed HTML that I as a user did not give to tidy.

When I see an error message saying "There is something wrong with line 10", I as a user assume that it's talking about the line 10 of the file that I gave it.

Tidy should not give me a warning about the HTML that it generated itself. Tidy created that error. The error is not in the HTML that I gave to tidy to check.

Even if Tidy said "I have changed <a name> to <a id>", it's still giving me an error about something that does not exist in the file I gave to Tidy.

geoffmcl commented 5 years ago

@petdance we seem to be spinning our wheels... stuck in the verbal mud ;=))

You keep repeating, in several different ways, about an error that does not exist in the input...

Wrong! I can only respectfully suggest that it does exist implicitly, at least - that is suggested though not directly expressed - in the input...

When you wrote <a name="x">, tidy silently helps by creating id="x", that you meant to write, hopefully ;=))

But semantics aside...

Can only repeat what tidy change do you want?

As this demonstrates, tidy does not only report on the input, during the reading of the stream stage, but it also reports on what is, will be, the output nodes, during its C&R stage... that is on the final complete package...

These two stages are separate libTidy API calls - Stage 1: tidyParseXXXX and then Stage 2: tidyCleanAndRepair...

Theoretically you could build your own app, say Tidy-parse-only, that only had the first API call, stage 1, and exit. That way you would only get any errors found in the input...

And in this case <div id="x"><a name="x"> would probably yield none... but quite unsure of the usefulness of such an app... it would miss so much...

The current console tidy.c always does both, and much more - as it should - will not change, but as always output improvements welcome...

This seems to be going nowhere... hope others will chime in... thanks...

ler762 commented 5 years ago

@petdance I think what you have is an error. Starting off with <!DOCTYPE html> is html5 - correct? In which case <a name=xxx isn't allowed

@geoffmcl should tidy use html4 rules for docs that start with

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">

https://www.w3.org/TR/html4/struct/links.html#anchors Destination anchors in HTML documents may be specified either by the A element (naming it with the name attribute), or by any other element (naming with the id attribute).

so should it leave <a name="content"> alone in that case? or say screw it & convert to html5 valid syntax? (which doesn't seem to be valid html4 syntax)

petdance commented 5 years ago

is html5 - correct? In which case <a name=xxx isn't allowed

Yes, that's correct, <a name> isn't allowed in HTML5 and that would be a good error to have, too.

petdance commented 5 years ago

I think the miscommunication is because @geoffmcl is talking about the internals of Tidy, and I'm talking about what the end user sees.

I'm a user who is asking for Tidy to tell me how my HTML looks. Tidy tells me Warning: <div> anchor "content" already defined, but that is not correct, because that error does not apply to the file I gave to Tidy. It applies to the file that Tidy has modified internally.

If the warnings said something like this, that would at least tell the user that the error is not in the original document.

The warnings that follow apply to Tidy's internal modification of your document.
line 10 column 9 - Warning: <div> anchor "content" already defined

Add a warning about this? ie the creation of an implicit id="x"...

More docs info that ... the actual problem may be elsewhere, even in something implicitly added by tidy?

Either of those would give me a clue, yes.

What would be most useful, I think, is if there was a way to tell Tidy "I just want to know about the errors that are in my original document." That's what my expectation is when I run tidy foo.html, but if we need to make it explicit and have a "check-only" mode, that would be fine, too.

geoffmcl commented 5 years ago

@ler762, @petdance now we seem to be getting somewhere... maybe several Feature Requests, FR, are needed...

Concerning <a name> and HTML4 vs HTML5

I noted this at the beginning of my code search. It seems the maintainers, before my time, who I understand were quite close to the W3C itself, chose for libTidy to say, in either H4 or H5 case - screw it & convert to html5 valid syntax - without a warning... so <a name="x"> is converted to <a name="x" id="x"> silently... been like this for maybe more than 10 years...

I have already lamented that this fix is silent... maybe FR 1: add a warning for this change...

Then there is sort of FR 2: like don't do this if H4 doctype... leave <a name="x"> alone...

They did added an option --anchor-as-name no to drop the name... presently defaults to yes... maybe FR 3: change default...

Note, if you add a HTML4 header, then <a name="x" id="x"> will not flag an error in the W3C validator... quite telling... and may influence the above FR decisions...

Anyone interested to pursue one or all these FR, or even bugs, please open a new issue!

Comments, patches, PR very welcome... thanks...

Improve the line 10 warning

Yes, one idea is your general statement - the actual problem may be elsewhere, even in something implicitly added by tidy. ... that needs to be added to the general tidy docs, one or more times... where?

I have already explored this a little, but could not see an easy solution...

In this created id case, there presently seems no flag to say this attribute is implicit, like there is when tidy creates a node, see node, so there is no way, when the conflict is later discovered, to do something like -

line 10 column 9 - Warning: &lt;div&gt; anchor "content" already [implicitly] defined 

So another idea would be to add a Bool implicit; flag to struct _AttVal, if it is thought important enough...

Maybe this could also be a separate new issue...

Comments, patches, PR very welcome... thanks...

User expectations of tidy

Yes, in general tidy tries to meet everyone reasonable expectations... sometimes the big problem is everyone is different...

In this case, I have tried to show, at length, as best I can, that expecting I just want to know about the errors that are in my original document. is not really possible... not reasonable even, given that tidy reports on the total package...

Yes, I suppose there could be a new option like check-input-only, and tidy.c would skip the tidyCleanAndRepair, but...

Really... that sort of breaks tidy's purpose - to fix a document... and as suggested do not think this would give you anything... but...

Comments, patches, PR very welcome... thanks...

ler762 commented 5 years ago

Tidy tells me Warning: <div> anchor "content" already defined, but that is not correct, because that error does not apply to the file I gave to Tidy.

Yes! I totally agree. The whole screw it & convert to html5 valid syntax - without a warning idea is wrong; tidy is supposed to tell you about any errors it finds. What if tidy did something like

$ tidy -q foo.html
line 7 column 9 - Error: 'name' attribute is not allowed in <a>, adding id="content"
line 10 column 9 - Warning: <div> anchor "content" already defined
<!DOCTYPE html>

@petdance would that be good enough for you?

petdance commented 5 years ago

Ideally, I'd like to be able to say "Don't do any modification and just tell me what is wrong with what I gave you". What you've proposed is a good middle ground.

geoffmcl commented 5 years ago

@ler762 thank you for the quick reply... sorry that I can not always oblige...

This seems a test implementation of FR 1 ...

YEAH! Thank you! Been long in coming... tidy must not be silent, at least for its educational aims... which are strong...

But you know me, the real test is in the exact implementation... not just the idea agreed...

Like do you really mean Error, ie return/exit 2, with all that means... maybe ok... especially if in H5 mode... help user correct a document, before output...

But is it a minor glitch? Why not just a warning? doc fixed...

Are you too accepting H4 and H5, what the hell policy? like W3C?... or should that be a later new issue...

@petdance seems this middle ground is a good step in the right direction...

Looking forward to more improvements... thanks...

ler762 commented 5 years ago

@petdance

Ideally, I'd like to be able to say "Don't do any modification and just tell me what is wrong with what I gave you".

I agree. But tidy was written to clean things up, so getting a msg when it changes something might be the best you can hope for.

@geoffmcl

Are you too accepting H4 and H5, what the hell policy?

No. But you seem to get annoyed if I criticize tidy, so I kept my opinion of tidy taking a valid html4 doc and creating an invalid doc to myself. As well as my opinion of the --anchor-as-name option.

But is it a minor glitch? Why not just a warning? doc fixed...

If I'm being pedantic, tidy doesn't actually fix an html5 doc unless anchor-as-name is off & it breaks an html4 doc no matter how anchor-as-name is set. But whatever.. for html 5, the name attribute in an <a> tag is not allowed, so it seemed reasonable to call it an error. Call it a warning if you like.. just as long as tidy spits out an error/warning message about what it did as well as giving me the ability to suppress that error/warning message. Remember - my use case is tidy cleaning up docbook generated html4; I don't want to see all the msgs for tidy adding an id=tag to <a name=tag instances.

geoffmcl commented 5 years ago

@ler762 I am sorry you think I get annoyed over criticism of tidy... I don't think I do, but... oh, well...

I certainly will call out what I think is an incorrect assertion... off in some way... or even incomplete, lacking some qualification... even typos, slips, like say forgetting to add something, etc, etc...

And really try to be helpful about why I think it is wrong... add references, etc, etc... that is not getting annoyed, in my book...

Exactly like someone saying tidy is taking a valid html4 doc and creating an invalid doc and then it breaks an html4 doc... IMO that does not seem exactly correct...

I assume you refer to <a name="x" id="x"> as invalid... how do you read say anchors-with-id... this clearly points out id can be part of an a element, in html4... also the validator seems to accept it...

Or am I misunderstand something... then what is the invalid part you suggest? Just trying to exactly understand what you think is broken... Please clarify what you mean... thanks...

And when called out, of course it can be annoying if a commenter just repeats the assertion, without any further clarification... that is, adds nothing but repeated noise... but with that I tend to go into quiet mode, and try to gently ignore...

And to be clear, you do remember that flagging an error means tidy will not attempt output unless forced. It is not a question of what I like...

More a question of what you, and/or others, want this new message to do... (a) stop output until changed, or (b) just warn of the change... at this stage, I certainly would tend towards the latter... but maybe could be convinced otherwise... and may depend on, and be different for, html4 or html5 modes...

Look forward to further feedback, pseudo-code, patches, PR to try to move this foward... thanks...

ler762 commented 5 years ago

@geoffmcl

Just trying to exactly understand what you think is broken...

Go back to the initial problem report and replace the first line with

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">

Is it then a valid html 4.01 doc? As far as I can tell, yes, it's valid html. But it's possible I'm missing something, so if it's not valid, what part of the html 4.01 spec is it not complying with?

Now run the doc through tidy. Is it still valid? I think the <a> and <div> having the same id="content" means no.

And to be clear, you do remember that flagging an error means tidy will not attempt output unless forced.

No, I didn't realise that prefacing a msg with "Error:" meant tidy will not attempt output unless forced. So make it "Warning: .." or leave the Error:/Warning: part off altogether.

geoffmcl commented 5 years ago

@ler762 I do lots of tests... have now push these 5 issue 786 tests to my https://github.com/geoffmcl/tidy-test/tree/master/test/input5 repo... hundreds of other test files...

Follows a brief summary of the tests done, and brief results, validator and tidy -

So yes, that html4 test was my in_786-2.html, created, tested, thought about, a few days ago... yes, tidy warns content duplicate... that was the beginning discussion...

I divided this up into several different, separate issues - see comment

  1. Concerning <a name> and HTML4 vs HTML5
  2. Improve the line 10 warning
  3. User expectations of tidy

We were talking about FR 1, maybe more on H4 vs H5... issue 1 - what to do about <a name="x">, if anything? and had asked this be a new issue... oh, well...

Now you seem back on about <div id="content">... you seem to have fallen backwards... back up the tree...

Hey, catch up mate, as we would say in Oz ;=))

Yes, it often falls to me, on reading, seeing some pseudo-code, or comments, to remind people of the actual tidy code... as best I can... sometimes on minimal information...

Of course if they took the time to actually try out their code, before commenting, present git patches, I am sure this would be less frequent... they would learn by the implementation, what is what... hopefully...

And to be very clear. I am not about to code any of this! It is just not that important to me... many other things take priority... but patches or PR will be considered, at least on that indicated, discussed and agreed, to some degree...

Progress would be much appreciated... thanks...

ler762 commented 5 years ago

@geoffmcl to close out this issue I'd just add a message saying what tidy did.

My apologies for not looking at the code first; I didn't realize that putting "Error: " in a message meant tidy would exit with an error status. So how about

$ tidy -q foo.html
line 7 column 9 - Warning: 'name' attribute is not allowed in <a>, adding id="content"
line 10 column 9 - Warning: <div> anchor "content" already defined
<!DOCTYPE html>

I divided this up into several different, separate issues - see comment Concerning and HTML4 vs HTML5

It'd be nice if there was a way to have tidy not add an id=... attribute for every <a name= attribute in an html4 doc but it's not something I'd work on right now.

Improve the line 10 warning

given that tidy adds an id= attribute to the <a name=, I don't see anything to improve.

User expectations of tidy

That tidy doesn't silently change things.

@geoffmcl:

you seem to have fallen backwards...

Not backwards; it's two different things:

  • What I'd do about this issue (add a msg about tidy adding the id attribute)
  • My opinion (I don't much like tidy adding the id attribute in html4 docs but I don't dislike it enough to come up with a patch)
RoyBellingan commented 4 years ago

Just happened also to me

<a name="top"></a> Became <a name="top" id="top"></a> (yes it makes no sense, but here is the web)

But just a few lines later I have <div id="top">

And boom, no xml parsing...

From what I can see, is not possible to opt for "do not add id" yet . (In theory this element beeing empty should also be dropped http://api.html-tidy.org/tidy/quickref_5.6.0.html#drop-empty-elements)