Open petdance opened 5 years ago
@petdance thanks for the issue... quite interesting... sort of exposed conflict
...
In researching this, I can see one of the actions of libTidy
is to clean and repare, C&R
, a document, and one of those actions is to examine each anchor
node, in TY_(FixAnchors)(doc, &doc->root, wantNameAttr, yes);
, which specifically checks for name
and id
, and if it finds just name
, adds id
as well...
Setting the option anchor-as-name: no
will drop the name
, but leave the potentially libTidy
created id
...
Hence the exposed created conflict... which libTidy
reports... with minimal info... docs improvement?
And it has been doing this since some time before the last cvs
release in 2009... but, as always, a long history does not make the action right... has it been discussed before... can't find ATM...
If this is considered an ongoing problem, it would seem the only answer would be to add an option, to specifically inhibit, or modify, this particular C&R
action... is there a use
case...
Looking forward to further feedback... thanks...
I don't know that there's a "use case". I just see a bug.
I think that it's incorrect to conflate <a name>
and <div id>
. In the example above, tidy tells me that I have a duplicated ID, but there is no duplicated ID. https://validator.w3.org/nu/ does not complain about a duplicated ID.
I don't know what else there is to say about it.
@petdance sorry, what exactly is the tidy
bug?
Of course nu
will only mention the use of name
if given the input... so...
There is no conflation
of anything...
As stated, tidy
chooses to fix the problems it sees in the anchor? Is that the bug you see? Tidy should not fix things, or something...
Given your input, at present the default tidy output, with -w 0
is, in part is -
<a name="content" id="content"></a> This <a> tag is for bookmarking.
If you add --anchor-as-name no
it will be -
<a id="content"></a> This <a> tag is for bookmarking.
Is there some error you see in this output
?
Do you want to instruct tidy to inhibit, modify its fix, of the anchor? If inhibit
why? If modify
, to what?
Because, if left unchanged, and the document also then contains say <div id="content">
, then tidy will report Warning: <div> anchor "content" already defined
...
Maybe could do better with the docs/description here, like explicit
, or even created in c&r
, or something... somehow making it clear somehow this is a created
situation... only in existing in the output...
As will the nu
validator, if run on the tidy results...
Am I misunderstanding something here? Please explain... thanks...
I am not using tidy to modify HTML. I am only using it to validate existing HTML. I am only using the errors & warnings.
The problem is that when I run
$ tidy -q -e foo.html
I get this warning:
line 10 column 9 - Warning: <div> anchor "content" already defined
Tidy is telling me that there is already an ID called "content", but that's incorrect. There is only one ID called "content".
The results are the same if I add --anchor-as-name yes
or --anchor-as-name no
.
@petdance ah, your use of tidy
... I am only using the errors & warnings.
... so...
line 10 column 9 - Warning: <div> anchor "content" already defined
Here, in reading the input, and studying the output, this warning warns you of TWO potential problems... but the second only exists after a fix has been applied to the first...
The first, use of name
in an anchor.
Tidy found <a name="content">
, so added id="content"
, as a fix, it has done for ages, and could drop the name
with --anchor-as-name no
...
This is a so called silent fix... I am also somewhat ambivalent on these... in general, always feel tidy
should spell out what it is doing...
So yes, tidy could add a warning something like - In anchor, found obsolete "name", and added "id". See option "anchor-as-name"
-
Is that what you are suggesting? That would certainly help in understanding what follows in this case...
Then the second potential problem. Oops, you already have <div id="content">
... so this clashes with the tidy
anchor fix... Need author to choose, and fix input document...
I do not think some argument like - I want to use obsolete name="content"
on an anchor so it does not clash with <div id="content">
- is a valid html request... one or the other has to be changed... IMO...
Really, I am just trying to find what tidy
change you want... thanks...
Tidy should not give me warnings about anything other than the file that I've given Tidy to check. If there's not a problem in the original file, then Tidy should not give me a warning.
What I understand is that tidy takes this:
<div id="x"><a name="x">
and turns it into this
<div id="x"><a id="x">
and then gives me a warning based on the transformed HTML that I as a user did not give to tidy.
When I see an error message saying "There is something wrong with line 10", I as a user assume that it's talking about the line 10 of the file that I gave it.
Tidy should not give me a warning about the HTML that it generated itself. Tidy created that error. The error is not in the HTML that I gave to tidy to check.
Even if Tidy said "I have changed <a name>
to <a id>
", it's still giving me an error about something that does not exist in the file I gave to Tidy.
@petdance we seem to be spinning our wheels... stuck in the verbal mud
;=))
You keep repeating, in several different ways, about an error that does not exist
in the input...
Wrong! I can only respectfully suggest that it does exist implicitly
, at least - that is suggested though not directly expressed
- in the input...
When you wrote <a name="x">
, tidy silently helps by creating id="x"
, that you meant to write, hopefully ;=))
But semantics aside...
Can only repeat what tidy
change do you want?
<a name="x">
? ie leave an obsolete
attribute... seems a bad idea...id="x"
...As this demonstrates, tidy does not only report on the input, during the reading of the stream stage, but it also reports on what is, will be, the output nodes, during its C&R
stage... that is on the final complete package...
These two stages are separate libTidy
API calls - Stage 1: tidyParseXXXX
and then Stage 2: tidyCleanAndRepair
...
Theoretically you could build your own app, say Tidy-parse-only
, that only had the first API call, stage 1, and exit. That way you would only get any errors found in the input...
And in this case <div id="x"><a name="x">
would probably yield none... but quite unsure of the usefulness of such an app... it would miss so much...
The current console tidy.c
always does both, and much more - as it should - will not change, but as always output improvements welcome...
This seems to be going nowhere... hope others will chime in... thanks...
@petdance I think what you have is an error. Starting off with
<!DOCTYPE html>
is html5 - correct? In which case <a name=xxx
isn't allowed
@geoffmcl should tidy use html4 rules for docs that start with
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
https://www.w3.org/TR/html4/struct/links.html#anchors Destination anchors in HTML documents may be specified either by the A element (naming it with the name attribute), or by any other element (naming with the id attribute).
so should it leave <a name="content">
alone in that case? or say screw it & convert to html5 valid syntax? (which doesn't seem to be valid html4 syntax)
is html5 - correct? In which case <a name=xxx isn't allowed
Yes, that's correct, <a name>
isn't allowed in HTML5 and that would be a good error to have, too.
I think the miscommunication is because @geoffmcl is talking about the internals of Tidy, and I'm talking about what the end user sees.
I'm a user who is asking for Tidy to tell me how my HTML looks. Tidy tells me Warning: <div> anchor "content" already defined
, but that is not correct, because that error does not apply to the file I gave to Tidy. It applies to the file that Tidy has modified internally.
If the warnings said something like this, that would at least tell the user that the error is not in the original document.
The warnings that follow apply to Tidy's internal modification of your document.
line 10 column 9 - Warning: <div> anchor "content" already defined
Add a warning about this? ie the creation of an implicit
id="x"
...More docs info that ... the actual problem may be elsewhere, even in something implicitly added by tidy?
Either of those would give me a clue, yes.
What would be most useful, I think, is if there was a way to tell Tidy "I just want to know about the errors that are in my original document." That's what my expectation is when I run tidy foo.html
, but if we need to make it explicit and have a "check-only" mode, that would be fine, too.
@ler762, @petdance now we seem to be getting somewhere... maybe several Feature Requests
, FR, are needed...
<a name>
and HTML4 vs HTML5I noted this at the beginning of my code search. It seems the maintainers, before my time, who I understand were quite close to the W3C itself, chose for libTidy to say, in either H4 or H5 case - screw it & convert to html5 valid syntax
- without a warning... so <a name="x">
is converted to <a name="x" id="x">
silently... been like this for maybe more than 10 years...
I have already lamented that this fix is silent... maybe FR 1: add a warning for this change...
Then there is sort of FR 2: like don't do this if H4 doctype... leave <a name="x">
alone...
They did added an option --anchor-as-name no
to drop the name... presently defaults to yes
... maybe FR 3: change default...
Note, if you add a HTML4 header, then <a name="x" id="x">
will not flag an error in the W3C validator... quite telling... and may influence the above FR decisions...
Anyone interested to pursue one or all these FR, or even bugs, please open a new issue!
Comments, patches, PR very welcome... thanks...
Yes, one idea is your general statement - the actual problem may be elsewhere, even in something implicitly added by tidy.
... that needs to be added to the general tidy docs, one or more times... where?
I have already explored this a little, but could not see an easy solution...
In this created id
case, there presently seems no flag to say this attribute is implicit
, like there is when tidy creates a node, see node, so there is no way, when the conflict is later discovered, to do something like -
line 10 column 9 - Warning: <div> anchor "content" already [implicitly] defined
So another idea would be to add a Bool implicit;
flag to struct _AttVal
, if it is thought important enough...
Maybe this could also be a separate new issue...
Comments, patches, PR very welcome... thanks...
Yes, in general tidy
tries to meet everyone reasonable expectations... sometimes the big problem is everyone is different...
In this case, I have tried to show, at length, as best I can, that expecting I just want to know about the errors that are in my original document.
is not really possible... not reasonable even, given that tidy reports on the total package...
Yes, I suppose there could be a new option like check-input-only
, and tidy.c
would skip the tidyCleanAndRepair
, but...
Really... that sort of breaks tidy's purpose - to fix a document... and as suggested do not think this would give you anything... but...
Comments, patches, PR very welcome... thanks...
Tidy tells me
Warning: <div> anchor "content" already defined
, but that is not correct, because that error does not apply to the file I gave to Tidy.
Yes! I totally agree. The whole screw it & convert to html5 valid syntax - without a warning
idea is wrong; tidy is supposed to tell you about any errors it finds. What if tidy did something like
$ tidy -q foo.html
line 7 column 9 - Error: 'name' attribute is not allowed in <a>, adding id="content"
line 10 column 9 - Warning: <div> anchor "content" already defined
<!DOCTYPE html>
@petdance would that be good enough for you?
Ideally, I'd like to be able to say "Don't do any modification and just tell me what is wrong with what I gave you". What you've proposed is a good middle ground.
@ler762 thank you for the quick reply... sorry that I can not always oblige...
This seems a test implementation of FR 1 ...
YEAH! Thank you! Been long in coming... tidy must not be silent, at least for its educational aims... which are strong...
But you know me, the real test is in the exact implementation... not just the idea agreed...
Like do you really mean Error
, ie return/exit 2, with all that means... maybe ok... especially if in H5 mode... help user correct a document, before output...
But is it a minor glitch? Why not just a warning
? doc fixed...
Are you too accepting H4 and H5, what the hell policy? like W3C?... or should that be a later new issue...
@petdance seems this middle ground is a good step in the right direction...
Looking forward to more improvements... thanks...
@petdance
Ideally, I'd like to be able to say "Don't do any modification and just tell me what is wrong with what I gave you".
I agree. But tidy was written to clean things up, so getting a msg when it changes something might be the best you can hope for.
@geoffmcl
Are you too accepting H4 and H5, what the hell policy?
No. But you seem to get annoyed if I criticize tidy, so I kept my opinion of tidy taking a valid html4 doc and creating an invalid doc to myself. As well as my opinion of the --anchor-as-name option.
But is it a minor glitch? Why not just a warning? doc fixed...
If I'm being pedantic, tidy doesn't actually fix an html5 doc unless anchor-as-name is off & it breaks an html4 doc no matter how anchor-as-name is set. But whatever.. for html 5, the name attribute in an <a> tag is not allowed, so it seemed reasonable to call it an error. Call it a warning if you like.. just as long as tidy spits out an error/warning message about what it did as well as giving me the ability to suppress that error/warning message. Remember - my use case is tidy cleaning up docbook generated html4; I don't want to see all the msgs for tidy adding an id=tag
to <a name=tag
instances.
@ler762 I am sorry you think I get annoyed over criticism of tidy... I don't think I do, but... oh, well...
I certainly will call out what I think is an incorrect assertion... off in some way... or even incomplete, lacking some qualification... even typos, slips, like say forgetting to add something
, etc, etc...
And really try to be helpful about why I think it is wrong
... add references, etc, etc... that is not getting annoyed
, in my book...
Exactly like someone saying tidy is taking a valid html4 doc and creating an invalid doc
and then it breaks an html4 doc
... IMO that does not seem exactly correct...
I assume you refer to <a name="x" id="x">
as invalid... how do you read say anchors-with-id... this clearly points out id
can be part of an a
element, in html4... also the validator seems to accept it...
Or am I misunderstand something... then what is the invalid
part you suggest? Just trying to exactly understand what you think is broken... Please clarify what you mean... thanks...
And when called out, of course it can be annoying if a commenter just repeats the assertion, without any further clarification... that is, adds nothing but repeated noise... but with that I tend to go into quiet mode, and try to gently ignore...
And to be clear, you do remember that flagging an error
means tidy will not attempt output unless forced
. It is not a question of what I like
...
More a question of what you, and/or others, want this new message to do... (a) stop output until changed, or (b) just warn of the change... at this stage, I certainly would tend towards the latter... but maybe could be convinced otherwise... and may depend on, and be different for, html4 or html5 modes...
Look forward to further feedback, pseudo-code, patches, PR to try to move this foward... thanks...
@geoffmcl
Just trying to exactly understand what you think is broken...
Go back to the initial problem report and replace the first line with
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
Is it then a valid html 4.01 doc? As far as I can tell, yes, it's valid html. But it's possible I'm missing something, so if it's not valid, what part of the html 4.01 spec is it not complying with?
Now run the doc through tidy. Is it still valid? I think the <a> and <div> having the same id="content"
means no.
And to be clear, you do remember that flagging an error means tidy will not attempt output unless forced.
No, I didn't realise that prefacing a msg with "Error:" meant tidy will not attempt output unless forced. So make it "Warning: .." or leave the Error:/Warning: part off altogether.
@ler762 I do lots of tests... have now push these 5 issue 786 tests to my https://github.com/geoffmcl/tidy-test/tree/master/test/input5 repo... hundreds of other test files...
Follows a brief summary of the tests done, and brief results, validator and tidy -
<div id="x"><a id="x">
to learn line no. is anchor, not div - val: dup show both - tidy: dup show anchorid
added, div as 'content2' - val: ok - tidy: okSo yes, that html4 test was my in_786-2.html, created, tested, thought about, a few days ago... yes, tidy warns content
duplicate... that was the beginning discussion...
I divided this up into several different, separate issues - see comment
<a name>
and HTML4 vs HTML5We were talking about FR 1, maybe more on H4 vs H5... issue 1 - what to do about <a name="x">
, if anything? and had asked this be a new issue... oh, well...
Now you seem back on about <div id="content">
... you seem to have fallen backwards... back up the tree...
Hey, catch up mate, as we would say in Oz ;=))
Yes, it often falls to me, on reading, seeing some pseudo-code, or comments, to remind people of the actual tidy code... as best I can... sometimes on minimal information...
Of course if they took the time to actually try out their code, before commenting, present git patches, I am sure this would be less frequent... they would learn by the implementation, what is what... hopefully...
And to be very clear. I am not about to code any of this! It is just not that important to me... many other things take priority... but patches or PR will be considered, at least on that indicated, discussed and agreed, to some degree...
Progress would be much appreciated... thanks...
@geoffmcl to close out this issue I'd just add a message saying what tidy did.
My apologies for not looking at the code first; I didn't realize that putting "Error: " in a message meant tidy would exit with an error status. So how about
$ tidy -q foo.html
line 7 column 9 - Warning: 'name' attribute is not allowed in <a>, adding id="content"
line 10 column 9 - Warning: <div> anchor "content" already defined
<!DOCTYPE html>
I divided this up into several different, separate issues - see comment Concerning and HTML4 vs HTML5
It'd be nice if there was a way to have tidy not add an id=...
attribute for every <a name=
attribute in an html4 doc but it's not something I'd work on right now.
Improve the line 10 warning
given that tidy adds an id=
attribute to the <a name=
, I don't see anything to improve.
User expectations of tidy
That tidy doesn't silently change things.
@geoffmcl:
you seem to have fallen backwards...
Not backwards; it's two different things:
Just happened also to me
<a name="top"></a>
Became
<a name="top" id="top"></a>
(yes it makes no sense, but here is the web)
But just a few lines later I have
<div id="top">
And boom, no xml parsing...
From what I can see, is not possible to opt for "do not add id" yet . (In theory this element beeing empty should also be dropped http://api.html-tidy.org/tidy/quickref_5.6.0.html#drop-empty-elements)
The
id
in the<div>
is seen as a duplicate of thename
in the<a>
. I don't think that that is actually a conflict.I put this HTML into https://validator.w3.org/nu/ and it only warned that "The name attribute is obsolete. Consider putting an id attribute on the nearest container instead."
I understand that the
<a name>
is obsolete, but I still don't think that it should be seen as conflicting with a<div id>
of the same value.