validator / htmlparser

The Validator.nu HTML parser https://about.validator.nu/htmlparser/
Other
56 stars 26 forks source link

Please add back `nu.validator.htmlparser.tools` #53

Open bmix opened 3 years ago

bmix commented 3 years ago

The original 1.4 distribution contained some example apps, that could be used from the command line. The author stated:

Sample Apps

The jar file contains sample main() entry points:

nu.validator.htmlparser.tools.XSLT4HTML5 nu.validator.htmlparser.tools.XSLT4HTML5XOM nu.validator.htmlparser.tools.HTML2XML nu.validator.htmlparser.tools.XML2HTML nu.validator.htmlparser.tools.XML2XML nu.validator.htmlparser.tools.HTML2HTML The first two are sample apps that demo the use of XSLT with HTML5. The first one can use SAX or DOM and requires the Xalan serializer. The second one uses XOM. Running without parameters dumps usage help.

java -cp htmlparser-1.4.jar nu.validator.htmlparser.tools.XSLT4HTML5 --template=sort-ul.xsl --input-html=test.html --output-html=out.html --mode=dom

HTML2XML converts HTML5 to XML 1.0 plus Namespaces. With no arguments, it reads from stdio and writes to stdout. With one parameter, it reads the named file and writes to stdout. With two parameters, the first is the input file name and the second is the output file name.

XML2HTML, HTML2HTML and XML2XML work analogously. The *2HTML versions produce bad output if the document tree is not serializable as HTML5. It is up to the user the make sure that it is.

The sourcecode is in test-src/nu/validator/htmlparser/tools/ but none of the releases I found on Maven Central has the classes built in. I do have an older JAR, which is also named htmlparser-1.4.jar on disk, from years ago, that had these classes and thus is usable from the CLI.

May I kindly ask you, to bring these back, so one can convert HTML into XHTML simply from the command line? Thank you!

dhouck commented 1 year ago

As far as I can tell, there is no currently-existing tool that does what HTML2XML does, and the obvious ways of writing one (eg. Python BeautifulSoup, HTML Tidy) donʼt actually work right especially around namespaces.

The version here also isnʼt ideal (Iʼm planning to submit another PR about that in a few minutes) but it would be better than everything else I could find, ie. nothing.