validator / validator

Nu Html Checker – Helps you catch problems in your HTML/CSS/SVG
https://validator.github.io/validator/
MIT License
1.71k stars 275 forks source link

Achieve information parity in JSON API results with the W3 non-nu validator #105

Open AMcBain opened 9 years ago

AMcBain commented 9 years ago

I maintain the bot, Selvvir, for the #css IRC channel on Freenode. I've been using the existing (non-nu) validator interface to enable the bot to report on the results of validation for websites people explicitly ask the bot about.

That interface is now suggesting I use the nu interface to make my requests. However the nu interface is missing information present in the non-nu validator's JSON and SOAP 1.2 output.

The non-nu validator output: https://validator.w3.org/check?uri=google.com&output=json https://validator.w3.org/check?uri=google.com&output=soap12

The nu validator output: https://validator.w3.org/nu/?doc=http%3A%2F%2Fgoogle.com&out=json

Specially, the non-nu validator returns an explicit declaration of the content type (text/html, etc.) and encoding (utf-8, etc) of the document being validated. The SOAP response of the non-nu validator also returns the doctype. I believe this information is useful to display in my condensed output report.

It would be quite nice if the nu validator's output could be updated with a source block equivalent to the non-nu validator's JSON output including the content type, encoding, and doctype information.

sideshowbarker commented 9 years ago

This seems reasonable and I’ll look at what specific code changes I’d need to make to get it added. In the mean time it’d also be good to get an opinion from @hsivonen on whether he thinks this has enough general utility to justify the cost of adding it.

hsivonen commented 9 years ago

What's the use case for obtaining the content type, encoding and doctype? I can imagine content type and encoding having some legitimate use. I have a hard time imagining the doctype having legitimate uses.

AMcBain commented 9 years ago

In our case if someone is having issues and we feed it to the bot to get back validation results, we can easily point to things like a missing or old doctype. Things that might cause quirksmode or other odd stuff. Encoding is similarly nice because if things aren't showing up right it's an easy pointer to say "well, it's not [UTF-8]. That's your problem."

The people who join #css run the gamut from those who are pretty good and just got stuck on something complex (or had a brain fart and missed something obvious) to people who don't have the slightest about much of anything or even appear to have picked up web development yesterday. So to help us to help them, I think those two things would make a nicer condensed output.

As to content type, I can't defend that. :) If you know of a use, great. Otherwise I don't have one at the moment.

The old bot I replaced had the output "« [truncated version of input URL] » markup • errors: [number of errors] • warnings: [number of warnings] • doctype: [doctype] • charset: [charset] • validation result: [shortened URL to the validator page with the full results]"

Since I didn't want to parse SOAP output to get the doctype, the current bot outputs "« %s » markup • errors: %s • warnings: %s • charset: %s • validation result: %s". So basically the same output as the old but, just missing the doctype entry.

AMcBain commented 9 years ago

Oh! Hey, so I thought I should point out that two of these three are already available: charset and content-type. However they're supplied as part of either a warning (charset), which does not exist if you use UTF-8, and an info block (content-type). This requires string parsing to get at which would be fragile if future code updates ever changed the format. Which is why above I was hoping for a separate block or part of the message that contains the info directly.

I also found this page https://github.com/validator/validator/wiki/Output:-JSON which may have been started from an older doc because it indicates that you're already returning the info I want as part of a separate "source" block.

Thank you. :)

preaction commented 9 years ago

I think it's more usefully "This is the content type, doctype, and encoding that the validator used to validate this document", which is important to know when interpreting the results. The heuristics the validator uses to determine this should not be duplicated by the other systems (that sounds like incompatibility waiting to happen). Having that info be just another log message makes it hard to usefully present that info to the user without simply giving them the entire logged output from the validator. Worse, if someone tries parsing those info lines to get at the data, they run afoul of those lines changing somehow. So either your log text is now an API, or the consumer must adapt to changing log text.

sideshowbarker commented 9 years ago

In our case if someone is having issues and we feed it to the bot to get back validation results, we can easily point to things like a missing or old doctype. Things that might cause quirksmode or other odd stuff.

OK, that use case does make some sense. But that said, the checker is already always going to emit an error or warning message about a missing or old doctype. So the only case where you're not going to see that is if you're only looking at some very (overly) condensed results.

In such a case I don't see why the information about the missing or old doctype is any more important than an of the error messages that (it seems) don't get included in the condensed results. Can you explain why it would be?

One thing that needs to be explained here about the doctype information vs other information is that the doctype information is not something that normally gets exposed to the rest of the checker backend beyond the parser. The reason for that is, the doctype information is never relevant to anything beyond the parser. Specifically, it's not relevant at all for checking whether elements and attributes in the document conform to requirements in the HTML spec.

So there is a signifcant cost to exposing the doctype information further down the chain. And at this point it's not clear the cost of doing that is outweighed by any benefits.

Encoding is similarly nice because if things aren't showing up right it's an easy pointer to say "well, it's not [UTF-8]. That's your problem."

True, but again, the checker is already always going to emit an error or warning message about a document that uses any encoding other than UTF-8. So as with the doctype info, it's not clear to me yet why the encoding info should be given any more important that other errors that could be presented in condensed output. I guess just because it's an error that can be reported less verbosely?

sideshowbarker commented 9 years ago

I should point out that two of these three are already available: charset and content-type. However they're supplied as part of either a warning (charset), which does not exist if you use UTF-8, and an info block (content-type).

Right.

However they're supplied as part of either a warning (charset), which does not exist if you use UTF-8,

In that case, though, you know the encoding is UTF-8, so you can safely have your tool report it as such.

This requires string parsing to get at which would be fragile if future code updates ever changed the format.

We have no plans to change the content of those charset and content-type messages. I can't remember having ever made changes to them in the past, and I can't think of any circumstances under which we'd want to change them in the future.

I realize that doing string parsing from the text of error messages is some extra work for a consuming tool to implement. But it's also extra work for us to implement separate (redundant) reporting just for those two things upstream. And it's not clear to me yet how many other developers of consuming applications out there we use this if we added it.

sideshowbarker commented 9 years ago

This requires string parsing to get at which would be fragile if future code updates ever changed the format.

I'd also like to add that in the unlikely event we ever did change the text of those messages, I would imagine you'd notice pretty quickly, and it would be a relatively minor amount of work to update the string matching in your tool to recognize the new message.

I can imagine there are also certain messages that developers of other reporting tools might find it particularly useful to catch and highlight in some way in their reporting. We cannot guarantee that we're not going to change the text of those messages either—nor of course could we add separate additional (and again, redundant) blocks in the error output for every type of error message someone else might want to highlight in their downstream reporting output.

AMcBain commented 9 years ago

OK, that use case does make some sense. But that said, the checker is already always going to emit an error or warning message about a missing or old doctype. So the only case where you're not going to see that is if you're only looking at some very (overly) condensed results.

In such a case I don't see why the information about the missing or old doctype is any more important than an of the error messages that (it seems) don't get included in the condensed results. Can you explain why it would be?

Well sometimes a lot of errors they see, visually, as we are a CSS channel are due to validation errors in the markup. A bad doctype is one of those things. The results from the validator are condensed because it's an IRC bot. So we don't have a lot of space to show information. This is why we also link to the results page.

However it's not enough just to do that, because the condensed info can be quite useful in that if they have a whole bunch of errors we can point out that they need to go resolve those before we can really get to helping them about their styles. So you know maybe 1 error (if it's not a doctype error) isn't a big deal, but if we're seeing 20, or 40 ... that is.

As to why a missing doctype or such is important, browsers tend to switch in to quirksmode which really screws with rendering and makes it pretty for us to help people in that case. They need to fix that first. In doing so, their problems might even go away. If I could easily call out a bad doctype or missing doctype in the condensed IRC bot results, it'd be very easy to tell them without having to visit the full results page. Otherwise it'd just be hidden as 1 error of whatever many.

True, but again, the checker is already always going to emit an error or warning message about a document that uses any encoding other than UTF-8. So as with the doctype info, it's not clear to me yet why the encoding info should be given any more important that other errors that could be presented in condensed output. I guess just because it's an error that can be reported less verbosely?

Yes, it does. However I think it's more of a meta-warning or meta-error. Most warnings and errors from the system are for content itself, whereas encoding and content-type are header-based info or from http-equiv meta tags.

Also I can definitely present it more concisely than the message I get back as a warning, just by presenting the encoding by itself (or similar) and in order to do that current I have to parse your warning message, the format of which may change in the future and break my stuff.

In that case, though, you know the encoding is UTF-8, so you can safely have your tool report it as such. Yeah, but that doesn't help me figure out the other cases when it's not, without string parsing. :)

I realize that doing string parsing from the text of error messages is some extra work for a consuming tool to implement. But it's also extra work for us to implement separate (redundant) reporting just for those two things upstream. And it's not clear to me yet how many other developers of consuming applications out there we use this if we added it.

Well I think I'd like to reiterate in context of the original post I made at the top. The previous validators were already providing this information. If you choose not to provide this information, you're either reducing the information that was obtainable or making that same info harder to obtain. I'm likely not the only consumer of this API, since it was there long before I found it and decided to use it, so I would hazard a guess that there are probably quite a few people who relied on that information being called out explicitly.

I'd also like to add that in the unlikely event we ever did change the text of those messages, I would imagine you'd notice pretty quickly, and it would be a relatively minor amount of work to update the string matching in your tool to recognize the new message.

I can imagine there are also certain messages that developers of other reporting tools might find it particularly useful to catch and highlight in some way in their reporting. We cannot guarantee that we're not going to change the text of those messages either—nor of course could we add separate additional (and again, redundant) blocks in the error output for every type of error message someone else might want to highlight in their downstream reporting output.

That may be, but I think that's kind of unfair to ask of downstream developers. For reasons that you've now turned warning output into a part of the API beyond just being reported info like anything else, and most people would likely expect that not to change. And even though it might be quick to update downstream code, they might not be able to get around to doing the changes that fast.

Finally, I'm not asking for every possible warning or error to be called out separately. Nor for them to be called out if they were removed from the standard output listing. What I was looking for was feature parity of three particular called out pieces of document meta-information of documents that existed in the older validator. This information was fixed, and is not going to change given development is no longer progressing on it.

sideshowbarker commented 9 years ago

What I was looking for was feature parity of three particular called out pieces of document meta-information of documents that existed in the older validator.

Please recognize that feature parity with the legacy validator in all things is not an explicit goal for this checker. The legacy validator actually has many misfeatures that we have no intention of replicating. I'm not saying that this is one of those misfeatures but it's not clear to me yet that we want to have feature parity on this particular one or what priority this should have relative to the roughly 2 dozen or so other issues we currently have open here.

AMcBain commented 9 years ago

Well I wasn't asking to be done now. Like right now. I do have patience. :)

I just submitted for consideration, believing that at least two of the three things would be somewhat easy to do given they're already provided in some form: encoding and content-type.

These are just things I feel that I, and quite possibly others, would need before we switched away from the old validator to the new one. Hopefully they'd be more easily provided as we've talked about before the old validator goes the way of the dodo.

If encoding and content-type were provided and this was left open (or this one closed and another ticket opened for the remainder) until a later date after other more important tickets are completed I think I would be elated at that, as it would at least let me continue output as I have it now from the old validator.

(By that I mean splitting it in to two parts, content-type and encoding vs doctype, to enable them to be put in the todo list order independently if that helps with getting tasks done. It also would then leave doctype up for debate separately from the others.)

Please recognize that feature parity with the legacy validator in all things is not an explicit goal for this checker. The legacy validator actually has many misfeatures that we have no intention of replicating.

Certainly. However I think as these three things were called out explicitly in the past validator people may have come to depend on them compared to the messages. I rather think that most people avoided trying to parse the info/error messages and output bits and pieces of them or to find particular messages so I'm guessing any "misfeatures" in there wouldn't really be missed.

sideshowbarker commented 9 years ago

I think it's more usefully "This is the content type, doctype, and encoding that the validator used to validate this document", which is important to know when interpreting the results.

Why, exactly? It's not axiomatic (that this data is "important to know when interpreting the results"). Especially it's not clear that it's always important to report to end users.

And regardless, the validator does already always report the content type, and it also already emit a message for any document with an encoding that's not UTF-8.

The heuristics the validator uses to determine this should not be duplicated by the other systems (that sounds like incompatibility waiting to happen).

The checker doesn't use heuristics to determine those things. It strictly follows the requirements for HTML parsers that are defined in that HTML spec. That is literally all that it's doing. Any other tool or system these days that's separately processing HTML content should also be using an HTML parser that conforms to the requirements in the spec. Otherwise, that’s incompatibility waiting to happen.

Having that info be just another log message makes it hard to usefully present that info to the user without simply giving them the entire logged output from the validator.

As I stated in an earlier comment here, it's not yet clear to me why these particular pieces of information/errors are any more important to separately call out to the user. Normally, users should rightly see the entire logged output from the checker. I cannot think of many cases where selectively highlighting just these three pieces of information is going to be especially useful to the user.

Worse, if someone tries parsing those info lines to get at the data, they run afoul of those lines changing somehow. So either your log text is now an API, or the consumer must adapt to changing log text.

For better or worse, for the vast majority of errors, the log text is in fact that API. We make no guarantees of the text of any particular messages changing. (That said, of course we're also not going in and changing any arbitrarily.) But for these particular messages, as I noted in an earlier comment here, I cannot anticipate a reason why we'd ever want to change them. So in that regard, this part of the de-facto API is probably way more stable relative to the rest.

sideshowbarker commented 9 years ago

These are just things I feel that I, and quite possibly others, might need before we switched away from the old validator to the new one. Hopefully they'd be more easily provided as we've talked about before the old validator goes the way of the dodo.

Yeah, I'm highly sympathetic to that. We really need for everybody to be moving away from the legacy validator. So if this helps enough people with that, it certainly merits it some additional priority.

sideshowbarker commented 9 years ago

I also found this page https://github.com/validator/validator/wiki/Output:-JSON which may have been started from an older doc because it indicates that you're already returning the info I want as part of a separate "source" block.

That information is actually current. It’s just that we only emit the source field if you specify the showsource param in your request. For example, see this output:

https://validator.w3.org/nu/?doc=https%3A%2F%2Fgithub.com&out=json&showsource=yes

You should find "source":{"type":"text/html","encoding":"utf-8",… in that output.

Of course along with that metadata about the source, it also emits the entire HTML source for the document. But since it sounds like your tool isn't passing on the entire output the user anyway, maybe that's not a problem for your use case? I mean, your tool doesn't actually need to actually process the source just because it's emitted in the output—any more than it needs to consume the content of any of the other messages that might be emitted.

So maybe for right now at least you can already get most of what you need just be adding the showsource param to your request. It just lacks the doctype information (which I recognize you saying is still important to you).

AMcBain commented 9 years ago

The checker doesn't use heuristics to determine those things. It strictly follows the requirements for HTML parsers that are defined in that HTML spec. That is literally all that it's doing. Any other tool or system these days that's separately processing HTML content should also be using an HTML parser that conforms to the requirements in the spec. Otherwise, that’s incompatibility waiting to happen.

Well I think your last bit was more his point. It's far easier and less prone to problems to get the information from someone who knows it and has already used the right tools than for us to do it ourselves, even with the right libraries. We're calling out to your app/framework so that our consumer, the IRC bot, doesn't have to implement this sort of functionality. You will have (or already have) gotten it right, we're just piggybacking.

As I stated in an earlier comment here, it's not yet clear to me why these particular pieces of information/errors are any more important to separately call out to the user. Normally, users should rightly see the entire logged output from the checker. I cannot think of many cases where selectively highlighting just these three pieces of information is going to be especially useful to the user.

I can. I see these three things as the "big things" of meta information about documents. A missing end tag might cause some small portion of the page to not render right, but a missing/wrong doctype or non-UTF8 encoding would cause the entire page to render wrongly.

Yeah, I'm highly sympathetic to that. We really need for everybody to be moving away from the legacy validator. So if this helps enough people with that, it certainly merits it some additional priority.

I really think it would. For those who already consume the JSON API like I do, all we'd have to do is change the URL if the source object still existed with the info we were looking for in it.

AMcBain commented 9 years ago

Of course along with that metadata about the source, it also emits the entire HTML source for the document. But since it sounds like your tool isn't passing our the entire output the user anyway, maybe that's not a problem for your use case? I mean, your tool doesn't actually need to actually process the source just because it's emitted in the output—any more than it needs to consume the content of any of the other messages that might be emitted.

.

So maybe for right now at least you can already get most of what you need just be adding the showsource param to your request. It just lacks the doctype information (which I recognize you saying is still important to you).

Oh. Whoa. Yeah, that gives me two out of the three. I don't need the entire source output, though, no. It would go unused.

Maybe I missed some relevant wiki page, but I didn't know that showsource param existed until you just mentioned it, despite that wiki page. It doesn't say on there that you need that param to get it to show up.

AMcBain commented 9 years ago

It just lacks the doctype information (which I recognize you saying is still important to you).

Correct.

sideshowbarker commented 9 years ago

Maybe I missed some relevant wiki page, but I didn't know that showsource param existed until you just mentioned it, despite that wiki page. It doesn't say on there that you need that param to get it to show up.

Yeah, we don’t have all the possible params documented in the readme file, but for now it’s at least documented in the wiki at https://github.com/validator/validator/wiki/Service:-Common-parameters#parameters-for-all-facets along with other parameters that are common to all output formats.

AMcBain commented 9 years ago

Ok, great. :)

sideshowbarker commented 9 years ago

I can. I see these three things as the "big things" of meta information about documents.

I hear you saying that but I'm not convinced that problems with any of those three things are necessarily often (or even usually) the most helpful problems to call out to the user. That said, I recognize that in some highly condensed output, you have to make some choices about what to show.

A missing end tag might cause some small portion of the page to not render right, but a missing/wrong doctype or non-UTF8 encoding would cause the entire page to render wrongly.

It's clear that a missing/wrong doctype puts browsers into quirks mode, yeah. But for a lot of documents that doesn't affect the rendering of the document drastically (or maybe even at all), so I'm not sure that's absolutely one of the most important errors to call out. (Also, in some cases an author may actually be depending on a document to be rendered in quirks mode, and adding or changing the doctype might actually break the author's existing formatting expectations).

And as far as the encoding, if an author has actually properly encoded a document in a some encoding other than UTF-8, then naively changing the charset metadata in the header or using a meta element is in fact going likely to cause the document to be rendered incorrectly

Finally, regarding the example you cited of missing end tag, there are in fact a lot of cases where an missing end tag is going to cause significant problems in the rendering of the document. Most drastically, if a document omits a title end tag or script end tag in the head of the document, that's often going to cause the entire contents of the document to not be rendered at all. Along with those, consider what happens if document omits a ul or ol end tag, or omit an i or b end tag. Those are all much bigger problems than anything a wrong or missing doctype is going to cause on its own.