htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.72k stars 418 forks source link

JSON output #716

Open MaxLanar opened 6 years ago

MaxLanar commented 6 years ago

Hello,

flycheck, a on the fly syntax checking solution for GNU Emacs ( http://www.flycheck.org ) has an issue with tidy's output in localized environments. It can't run properly except in an English setup, because localized error messages also localize the diagnostic type (e.g. "Warning" becomes "Avertissement" in French), which make them unparseable by Flycheck.

It has been figured out, here (flycheck's issue 1376), that the best way to resolve this, both for flycheck and tidy, would be for tidy command line client to be able to produce JSON output.

So here is the feature request, please provide JSON output to tidy command line client.

Thanks !

geoffmcl commented 6 years ago

@MaxLanar thanks for the Feature Request...

Read through the links, and see you caught the attention of @balthisar...

Interesting use of the tidy console app... and glad you found the -lang en option...

Hmmm, produce JSON output? While I am somewhat familiar with json, can you give an example of what you expect?

Take for example $ echo "hello world" | tidy -lang en, you get the stderr output of -

line 2 column 1 - Warning: missing <!DOCTYPE> declaration
line 2 column 1 - Warning: plain text isn't allowed in <head> elements
line 2 column 1 - Info: <head> previously mentioned
line 2 column 1 - Warning: inserting implicit <body>
line 2 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML5
Tidy found 4 warnings and 0 errors!

About HTML Tidy: https://github.com/htacg/tidy-html5
Bug reports and comments: https://github.com/htacg/tidy-html5/issues
Official mailing list: https://lists.w3.org/Archives/Public/public-htacg/
Latest HTML specification: http://dev.w3.org/html5/spec-author-view/
Validate your HTML documents: http://validator.w3.org/nu/
Lobby your company to join the W3C: http://www.w3.org/Consortium

Do you speak a language other than English, or a different variant of 
English? Consider helping us to localize HTML Tidy. For details please see 
https://github.com/htacg/tidy-html5/blob/master/README/LOCALIZE.md

What would that look like in json, just to understand more?

Or perhaps more importantly what would be the json if you add say -lang fr... thanks...

balthisar commented 6 years ago

@geoffmcl, I encouraged them to file this feature request after a question to the W3C mailing list.

If you take a look at the XML output routines for documentation, this would look a lot like that, except dump JSON instead of XML, using the filter callback. In English/French/whatever, it would return all of the data from the TidyMessageCallback API, which includes both the built in strings (always) as well as the localized strings for the current language.

I'm actually surprised no one has asked for this facility before, because it almost eliminates the need for interfacing to the C library from non-C languages, as almost all of these scripting languages support some type of system() call and the ability to parse JSON.

It would probably look something like:

{
  "filename": "tidyme.html",
  "messages: [
    "message": {
      "messageLine": 1, 
      "messageColumn": 22, 
      "messageLevel": 1, 
      "messageIsMuted": false, 
      "messageDefault": "inserting missing 'title' element",
      "message": "poniendo elemento 'title' que hace falta",
      _{etc}_
    },
  ]
}

This could be extended to include the actual document output, although I recommend still using STDOUT or a file for that (escaping a huge HTML document properly for JSON isn't pretty), include an array for "configuration", etc.

Actually, separately, I might enter a feature request that allows tidycfg files to written in JSON in the future, too. This potentially lessens the burden on many, many tools that work with the console application instead of using LibTidy directly.

geoffmcl commented 6 years ago

@balthisar thanks for the feedback, especially the sample json output... that really helped...

Quite some time ago I experimented with a tidy-json app, rendering the tidy DOM like html tree as json... This is the output using my test input in_704.html, which as can be seen is just one line hello & bye...

{
  "in_file" : "F:\\Projects\\tidy-test\\test\\input5\\in_704.html",
  "out_file" : "temp.json",
  "name" : "#Root",
  "content" : [
    {
      "name" : "#DOCTYPE",
      "attributes" : [
        {
        "name" : "PUBLIC"
        }
      ],
      "name" : "html",
      "content" : [
        {
          "name" : "head",
          "content" : [
            {
              "name" : "meta",
              "attributes" : [
                {
                "name" : "name",
                "value" : "generator"
                },
                {
                "name" : "content",
                "value" : "HTML Tidy for HTML5 for Windows version 5.7.3"
                }
              ],
              "name" : "title"
            }
          ],
          "name" : "body",
          "content" : [
            {
              "name" : "#Text",
              "value" : "hello & bye\r\n"
            }
          ]
        }
      ]
    }
  ]
}

It was not too difficult to do the same for the messages, using the TidyMessageCallback API you mentioned - This is what I got after the first cut, and it pointed out some interesting issues -

{
 "filename": "F:\\Projects\\tidy-test\\test\\input5\\in_704.html",
 "messages": [
    "message": {
      "messageLine": 1,
      "messageColumn": 7,
      "messageLevel": 351,
      "messageIsMuted": true,
      "messageDefault": "missing <!DOCTYPE> declaration",
      "message": "dclaration <!DOCTYPE> manquante"
    },
    "message": {
      "messageLine": 1,
      "messageColumn": 7,
      "messageLevel": 351,
      "messageIsMuted": true,
      "messageDefault": "texte brut isn't allowed in <head> elements",
      "message": "texte brut n'est pas permis dans les lments <head>"
    },
    "message": {
      "messageLine": 1,
      "messageColumn": 7,
      "messageLevel": 350,
      "messageIsMuted": true,
      "messageDefault": "<head> previously mentioned",
      "message": "<head> prcdemment mentionns"
    },
    "message": {
      "messageLine": 1,
      "messageColumn": 7,
      "messageLevel": 351,
      "messageIsMuted": true,
      "messageDefault": "inserting implicit <body>",
      "message": "insertion implicite de <body>"
    },
    "message": {
      "messageLine": 1,
      "messageColumn": 7,
      "messageLevel": 351,
      "messageIsMuted": true,
      "messageDefault": "inserting missing 'title' element",
      "message": "ajout d'un lment 'title' manquant"
    },
    "message": {
      "messageLine": 0,
      "messageColumn": 0,
      "messageLevel": 350,
      "messageIsMuted": true,
      "messageDefault": "Document content looks like HTML5",
      "message": "Le contenu du document ressemble  HTML5"
    },
    "message": {
      "messageLine": 0,
      "messageColumn": 0,
      "messageLevel": 357,
      "messageIsMuted": true,
      "messageDefault": "Tidy found 4 avertissements and 0 erreur!\n",
      "message": "Tidy a trouv 4 avertissements et 0 erreur!\n"
    }
 ]
}

First it does not pass some json checking s/w I have... bombs at about line 3... can not yet see the problem, and appreciate someone pointing out the missing , or ]... But that led to other things -

  1. As someone else pointed out, why is there no config option language: <lang>? It is mentioned in the man tidy page. Yes, there is a -language <lang> option in console tidy... strange... do not quite understand...

  2. As can be see in every case the "messageIsMuted": true. This is because the message.muted member is always non-zero. I was running in debug mode, and MSVC fills stack memory with 0xcdcdcdcd, so this member is just not initialized. Small bug...

  3. Since I ran in -lang fr mode, while most messageDefault is English, some are in French, or partial French? Do not know, but that feels like a bug...

Have not had a chance to look at 3. and understand why... but maybe this is same as a previous bug where an ouput buffer was used as part of the input... especially given that some do seem a mixture of languages... so has maybe been fixed by a PR not yet merged... not sure...

Given that they can all be addressed, this brings up the possibility of why is this not done in such a separate, tidy-json app, built and shipped with tidy, rather than yet again adding it to console tidy.c...

Anyway out of time today, but look forward to some interesting feedback... thanks...

balthisar commented 6 years ago

It looks correct on the surface, and everything that should be escaped looks escaped. I agree, there's a pointer somewhere screwed up; it works okay in LibTidy; you're not trying to hold anything after the callback returns, are you? Every is dealloced after returning, so maybe that's it. Why does the French look like it's missing a lot of letters?

The sample is also exposing the messageLevel value. We can't let that happen. I'd skip emitting it completely, and use one of the other API accessors, like tidyGetMessagePrefixDefault, or one of the message keys.

Hmmm, I'll have to look into the messageIsMuted for the bug; I know that this was tested at one point. Just for the sake of caffeine management, you're not treating it as a string, right? It's a bool that needs to be printed into JSON as a string.

I suppose the advantage of putting it into tidy.c is because everything is in tidy.c, including all of the documentation generation stuff, and everyone has tidy.c by default, without having to fuss with installer options, cmake options, etc. It's just another alternative output format, and there's not really a disadvantage.

I'll dig into it more, too, when I have some time. As you know, I've been swamped.

geoffmcl commented 6 years ago

@balthisar missing letters, yeah, my json escaping service omitted all utf-8 chars, now fixed... quick and crudely...

Of course I am extract all the items into a std::string before the callback returns...

And now see the results->muted member is only set after this callback - messageobj.c - setting this needs to be moved to before the callback, or something...

I converted the messageLevel from a rather meaningless number to a string, and added a few more outputs... Still open to discussion what should and should not be included...

Will try to attach the file output, since I am always unsure what copy & paste does to utf-8... of course had to add the .txt extent to get it accepted -

tempmsg.json.txt

There are several points about making this a separate distributed app -

  1. There is no fuss with install... just a few cmake lines and it is done... that seems to be spreading FUD
  2. As a separated app it can have its own config options, like indenting and newlines. Unless the output is being read by a human, seems no need to pretty print it.
  3. Like say git, and many other unix apps, it should be broken into several apps, and to me that could include the current xml config outputs.
  4. Like my tidy-json there is no need to stick to K&R C... makes it easier to write, maintain, understand, etc using say C++ stl...
  5. Helps promote by example apps to show the power of the libtidy...

There seems no need to have an ever growing single tidy console app, that tries to do everything, every format... a single mega app... even seems contrary to some unix philosophy of many small tools...

I am still having a problem with the output passing my json checking s/w. It fails within the first 3 or 4 lines... Any help on that appreciated...

And now I have added the summary messages as well, but exactly what to include in the json could be subject to tidy-json specific config switches, in addition to the mentioned indenting and newlines options... so the user could select just as much as they need...

But whatever is eventually decided I am still seeing mixed language messages... Need to get to the bottom of this... ideas welcome... thanks...

geoffmcl commented 6 years ago

Seemed to have solved the valid json problem, and have added some other members - as usual renamed it .txt to allow uploading - msg-fr.json.txt

And have opened a #719 issue to address the bug in the callback API...

This is the current tidy-json app... appreciated if others could clone, build, and test this...

Look forward to further feedback... thanks...

geoffmcl commented 6 years ago

Ok, my tidy-json app is getting stable and mature...

Have now added everthing available from the TidyMessageCallback API - sample -

msg-fr2.json.txt

One of the last things is a sort of verbosity option, or more an information selection option, since it is probable that a user may not need all this information in json... need to think about this...

And as noted in issue #719, have solved one of the problems, the messageIsMuted value, there is still the problem of mixed languages in certain messages to be solved...

One interesting thing about the output of the nodes in json is that there is not an API to get whether a node is implicit, that is added by tidy, and not in the original input html... will try to find the time to add a Feature Request for this...

Look forward to others building and testing this, and further feedback... thanks...