htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.71k stars 418 forks source link

MS "smart tags" are not recognized #855

Open ivanov17 opened 4 years ago

ivanov17 commented 4 years ago

Hello! I've processed with tidy 5.6.0 some old html pages which have been made in good old times with MS Word. I'd like to clean html code and therefore I've used tidy with option --word-2000 yes. In general tidy makes this work perfectly.

But these pages have MS-specific "smart tags" and there I have a problem. Tidy doesn't processed pages which have tags <o:smarttagtype> and various <st1> tags such as <st1:state>, <st1:place>, and so on.

All errors that I get contains same things:

line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 15 column 1 - Error: <o:smarttagtype> is not recognized!
line 17 column 1 - Error: <o:smarttagtype> is not recognized!
line 114 column 39 - Error: <st1:city> is not recognized!
line 114 column 49 - Error: <st1:place> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 11 column 51 - Error: <o:smarttagtype> is not recognized!
line 12 column 76 - Error: <o:smarttagtype> is not recognized!
line 1516 column 15 - Error: <st1:state> is not recognized!
line 1516 column 26 - Error: <st1:place> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 15 column 1 - Error: <o:smarttagtype> is not recognized!
line 17 column 1 - Error: <o:smarttagtype> is not recognized!
line 19 column 1 - Error: <o:smarttagtype> is not recognized!
line 21 column 1 - Error: <o:smarttagtype> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 15 column 1 - Error: <o:smarttagtype> is not recognized!
line 17 column 1 - Error: <o:smarttagtype> is not recognized!
line 19 column 1 - Error: <o:smarttagtype> is not recognized!
line 21 column 1 - Error: <o:smarttagtype> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 14 column 1 - Error: <o:smarttagtype> is not recognized!
line 197 column 3 - Error: <st1:metricconverter> is not recognized!
line 199 column 33 - Error: <st1:metricconverter> is not recognized!
line 232 column 73 - Error: <st1:metricconverter> is not recognized!
line 244 column 28 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 345 column 20 - Error: <st1:metricconverter> is not recognized!
line 346 column 61 - Error: <st1:metricconverter> is not recognized!
line 386 column 14 - Error: <st1:metricconverter> is not recognized!
line 543 column 33 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 156 column 64 - Error: <st1:metricconverter> is not recognized!
line 193 column 28 - Error: <st1:metricconverter> is not recognized!
line 196 column 49 - Error: <st1:metricconverter> is not recognized!
line 198 column 46 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 221 column 74 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 14 column 1 - Error: <o:smarttagtype> is not recognized!
line 194 column 34 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 203 column 60 - Error: <st1:metricconverter> is not recognized!
line 212 column 63 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 142 column 37 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 14 column 1 - Error: <o:smarttagtype> is not recognized!
line 220 column 62 - Error: <st1:metricconverter> is not recognized!
line 230 column 78 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 155 column 18 - Error: <st1:metricconverter> is not recognized!
line 165 column 31 - Error: <st1:metricconverter> is not recognized!
line 175 column 35 - Error: <st1:metricconverter> is not recognized!
line 196 column 66 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 122 column 73 - Error: <st1:metricconverter> is not recognized!
line 137 column 71 - Error: <st1:metricconverter> is not recognized!
line 146 column 65 - Error: <st1:metricconverter> is not recognized!
line 160 column 26 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 14 column 1 - Error: <o:smarttagtype> is not recognized!
line 214 column 10 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 14 column 1 - Error: <o:smarttagtype> is not recognized!
line 691 column 17 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 498 column 40 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 248 column 31 - Error: <st1:metricconverter> is not recognized!
line 275 column 31 - Error: <st1:metricconverter> is not recognized!
line 350 column 16 - Error: <st1:metricconverter> is not recognized!
line 369 column 17 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 14 column 1 - Error: <o:smarttagtype> is not recognized!
line 1144 column 78 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 14 column 1 - Error: <o:smarttagtype> is not recognized!
line 840 column 35 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 15 column 1 - Error: <o:smarttagtype> is not recognized!
line 17 column 1 - Error: <o:smarttagtype> is not recognized!
line 106 column 20 - Error: <st1:date> is not recognized!
line 149 column 19 - Error: <st1:country-region> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 156 column 58 - Error: <st1:metricconverter> is not recognized!
line 158 column 75 - Error: <st1:metricconverter> is not recognized!
line 176 column 1 - Error: <st1:metricconverter> is not recognized!
line 199 column 33 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 158 column 71 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 197 column 64 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 134 column 64 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 143 column 21 - Error: <st1:metricconverter> is not recognized!
line 151 column 42 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 132 column 60 - Error: <st1:metricconverter> is not recognized!
line 201 column 12 - Error: <st1:metricconverter> is not recognized!
line 208 column 32 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 14 column 1 - Error: <o:smarttagtype> is not recognized!
line 207 column 72 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 119 column 76 - Error: <st1:metricconverter> is not recognized!
line 157 column 3 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 14 column 1 - Error: <o:smarttagtype> is not recognized!
line 157 column 62 - Error: <st1:metricconverter> is not recognized!
line 194 column 47 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 232 column 58 - Error: <st1:metricconverter> is not recognized!
line 235 column 47 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 141 column 53 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 15 column 1 - Error: <o:smarttagtype> is not recognized!
line 135 column 71 - Error: <st1:personname> is not recognized!
line 141 column 47 - Error: <st1:metricconverter> is not recognized!
line 218 column 46 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 236 column 15 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 315 column 18 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 169 column 29 - Error: <st1:metricconverter> is not recognized!
line 180 column 24 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 14 column 1 - Error: <o:smarttagtype> is not recognized!
line 257 column 66 - Error: <st1:place> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 14 column 1 - Error: <o:smarttagtype> is not recognized!
line 139 column 10 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 14 column 1 - Error: <o:smarttagtype> is not recognized!
line 160 column 76 - Error: <st1:metricconverter> is not recognized!
line 176 column 37 - Error: <st1:metricconverter> is not recognized!
line 197 column 38 - Error: <st1:metricconverter> is not recognized!
line 206 column 32 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 301 column 44 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 121 column 12 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 224 column 14 - Error: <st1:metricconverter> is not recognized!
line 236 column 32 - Error: <st1:metricconverter> is not recognized!
line 286 column 71 - Error: <st1:metricconverter> is not recognized!
line 288 column 41 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 143 column 9 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 147 column 54 - Error: <st1:metricconverter> is not recognized!
line 163 column 52 - Error: <st1:metricconverter> is not recognized!
line 174 column 63 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 134 column 75 - Error: <st1:metricconverter> is not recognized!
line 327 column 75 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 124 column 3 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 254 column 30 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 396 column 11 - Error: <st1:metricconverter> is not recognized!
line 411 column 25 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 139 column 29 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 14 column 1 - Error: <o:smarttagtype> is not recognized!
line 404 column 14 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 262 column 16 - Error: <st1:metricconverter> is not recognized!
line 292 column 63 - Error: <st1:metricconverter> is not recognized!
line 293 column 80 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 144 column 80 - Error: <st1:metricconverter> is not recognized!
line 215 column 57 - Error: <st1:metricconverter> is not recognized!
line 233 column 67 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 180 column 38 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 119 column 33 - Error: <st1:metricconverter> is not recognized!
line 148 column 32 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 161 column 61 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 116 column 8 - Error: <st1:metricconverter> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
line 13 column 1 - Error: <o:smarttagtype> is not recognized!
line 724 column 66 - Error: <st1:personname> is not recognized!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.`

I tried other options such as --drop-proprietary-attributes yes and --bare yes, but I've got same errors.

I would like to get advice about it or workaround for this issue.

Thanks.


OS: Fedora release 30 (Thirty) x86_64 Kernel: 5.4.7-100.fc30.x86_64 HTML Tidy for Linux version 5.6.0

ler762 commented 4 years ago

On 1/16/20, anarchist Ivanov notifications@github.com wrote:

Hello! I've processed with tidy 5.6.0 some old html pages which have been made in good old times with MS Word. I'd like to clean html code and therefore I've used tidy with option --word-2000 yes. In general tidy makes this work perfectly.

But these pages have MS-specific "smart tags" and there I have a problem. Tidy doesn't processed pages which have tags <o:smarttagtype> and various <st1> tags such as <st1:state>, <st1:place>, and so on.

       <.. snip ..>

I would to get recommendations about it or workaround for this issue.

It'd be nice if someone could say how '--new-inline-tags' was supposed to work.

I found this page https://www.farrail.com/pages/touren-engl/Steam-in-china-2017-Sandaoling+Fuxin.php that has various '<st1' tags, so I tried

tidy --new-inline-tags st1:city,st1:place,st1:country-region

and it mangled the file pretty badly.

Removing the tags and marking it as an html 4 doc worked for me:

sed -E -e '1,1 s@^$@<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">@' \ -e 's@</?st1:[Cc]ity[^>]>@@g' \ -e 's@</?st1:[Pp]lace[^>]>@@g' \ -e 's@</?st1:[Cc]ountry-region[^>]*>@@g' \ /tmp/x.html |\ tidy -q -indent -wrap 120 --tidy-mark no \ --drop-empty-elements no \ --drop-empty-paras no \ --word-2000 yes --preserve-entities yes

Regards, Lee

geoffmcl commented 4 years ago

@ivanov17 thank you for your issue... lots of error messages...

But no sample(s) to try, experiment, test, understand, begin, etc...

And, please, no, I do not want the full word outputs, nor pointers to html in the wild...

Just minimum, to be a word document, and word html snippets that have only one of the tag types...

With such test samples, we can begin to see if a fix is easy, in the extra code run IFF --word-2000 yes is in the config... that code has not been touched in quite a long time... and maybe just a few tweaks... or maybe not!

Or as @ler762 suggests, sed can remove things, before tidy sees it...

So, some small, relevant, sample(s) please... maybe with output expected... thanks...

ivanov17 commented 4 years ago

@ler762 Thank you. I used these regular expressions and got rid of errors took place while processing my files:

sed -e ':again;$!N;$!b again; :b; s/<\/\?o:SmartTagType[^<>]*"\?>//g; t b' sed -e ':again;$!N;$!b again; :b; s/<\/\?st[1-9]:[^<>]*"\?>//g; t b'

I also downloaded and processed file linked to your answer and my regexps worked perfectly with it. I used options --word-2000 yes and --bare yes. But if I tried to add --new-inline-tags st1:city,st1:place,st1:country-region, tidy saves these tags inside the body of the document, encodes some of angle brackets inside the head and moves some of meta tags and title tag to the body:

<head>
  <meta name="generator" content="HTML Tidy for HTML5 for Linux version 5.6.0">
  <meta http-equiv="content-type" content="text/html; charset=utf-8">
  <meta http-equiv="keywords" name="keywords" content="Steam in China, &lt;st1:place w:st=">
  <title></title>
</head>
<body>
  <st1:city w:st="on">Sandaoling</st1:city>, Fuxin, class JS, 2-8-2 class SY, 2-8-2"&gt;
  <meta name="DESCRIPTION" content="FarRail Tours (+49-177-56 13 999) offers tours to the most interesting railway lines of the world. Tours for photographers and video film makers.">
  <meta name="PAGE-TOPIC" content="Steam in China, &lt;st1:place w:st="><st1:city w:st="on">Sandaoling</st1:city>, Fuxin, railway tours"&gt;
  <meta name="PAGE-TYPE" content="Visit to the last regular steam trains in China">
  <meta name="AUTHOR" content="Bernd Seiler">
  <meta name="PUBLISHER" content="FarRail Tours">
  <meta name="COPYRIGHT" content="Bernd Seiler, FarRail Tours">
  <meta name="ROBOTS" content="INDEX,FOLLOW">
  <title>Last Real Steam in China 2017:</title><st1:place w:st="on"><st1:city w:st="on">Sandaoling</st1:city></st1:place> and Fuxin
  <link rel="stylesheet" href="../../formate/standard-1.css">
ivanov17 commented 4 years ago

@geoffmcl

No problem. There are fragments of the file which Tidy has problems with:

<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:st1="urn:schemas-microsoft-com:office:smarttags"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 10">
<meta name=Originator content="Microsoft Word 10">
<title>XVII Congress of Association of Anarchist Movements</title>
<o:SmartTagType namespaceuri="urn:schemas-microsoft-com:office:smarttags"
 name="country-region"/>
<o:SmartTagType namespaceuri="urn:schemas-microsoft-com:office:smarttags"
 name="City"/>
<o:SmartTagType namespaceuri="urn:schemas-microsoft-com:office:smarttags"
 name="place"/>

Word's XML data and the