Open ivanov17 opened 4 years ago
On 1/16/20, anarchist Ivanov notifications@github.com wrote:
Hello! I've processed with tidy 5.6.0 some old html pages which have been made in good old times with MS Word. I'd like to clean html code and therefore I've used tidy with option
--word-2000 yes
. In general tidy makes this work perfectly.But these pages have MS-specific "smart tags" and there I have a problem. Tidy doesn't processed pages which have tags
<o:smarttagtype>
and various<st1>
tags such as<st1:state>
,<st1:place>
, and so on.
<.. snip ..>
I would to get recommendations about it or workaround for this issue.
It'd be nice if someone could say how '--new-inline-tags' was supposed to work.
I found this page https://www.farrail.com/pages/touren-engl/Steam-in-china-2017-Sandaoling+Fuxin.php that has various '<st1' tags, so I tried
tidy --new-inline-tags st1:city,st1:place,st1:country-region
and it mangled the file pretty badly.
Removing the tags and marking it as an html 4 doc worked for me:
sed -E -e '1,1 s@^$@<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">@' \ -e 's@</?st1:[Cc]ity[^>]>@@g' \ -e 's@</?st1:[Pp]lace[^>]>@@g' \ -e 's@</?st1:[Cc]ountry-region[^>]*>@@g' \ /tmp/x.html |\ tidy -q -indent -wrap 120 --tidy-mark no \ --drop-empty-elements no \ --drop-empty-paras no \ --word-2000 yes --preserve-entities yes
Regards, Lee
@ivanov17 thank you for your issue... lots of error messages...
But no sample(s) to try, experiment, test, understand, begin, etc...
And, please, no, I do not want the full word outputs, nor pointers to html in the wild...
Just minimum, to be a word document, and word html snippets that have only one of the tag types...
With such test samples, we can begin to see if a fix is easy, in the extra code run IFF --word-2000 yes
is in the config... that code has not been touched in quite a long time... and maybe just a few tweaks... or maybe not!
Or as @ler762 suggests, sed
can remove things, before tidy sees it...
So, some small, relevant, sample(s) please... maybe with output expected... thanks...
@ler762 Thank you. I used these regular expressions and got rid of errors took place while processing my files:
sed -e ':again;$!N;$!b again; :b; s/<\/\?o:SmartTagType[^<>]*"\?>//g; t b'
sed -e ':again;$!N;$!b again; :b; s/<\/\?st[1-9]:[^<>]*"\?>//g; t b'
I also downloaded and processed file linked to your answer and my regexps worked perfectly with it. I used options --word-2000 yes
and --bare yes
.
But if I tried to add --new-inline-tags st1:city,st1:place,st1:country-region
, tidy saves these tags inside the body of the document, encodes some of angle brackets inside the head and moves some of meta tags and title tag to the body:
<head>
<meta name="generator" content="HTML Tidy for HTML5 for Linux version 5.6.0">
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<meta http-equiv="keywords" name="keywords" content="Steam in China, <st1:place w:st=">
<title></title>
</head>
<body>
<st1:city w:st="on">Sandaoling</st1:city>, Fuxin, class JS, 2-8-2 class SY, 2-8-2">
<meta name="DESCRIPTION" content="FarRail Tours (+49-177-56 13 999) offers tours to the most interesting railway lines of the world. Tours for photographers and video film makers.">
<meta name="PAGE-TOPIC" content="Steam in China, <st1:place w:st="><st1:city w:st="on">Sandaoling</st1:city>, Fuxin, railway tours">
<meta name="PAGE-TYPE" content="Visit to the last regular steam trains in China">
<meta name="AUTHOR" content="Bernd Seiler">
<meta name="PUBLISHER" content="FarRail Tours">
<meta name="COPYRIGHT" content="Bernd Seiler, FarRail Tours">
<meta name="ROBOTS" content="INDEX,FOLLOW">
<title>Last Real Steam in China 2017:</title><st1:place w:st="on"><st1:city w:st="on">Sandaoling</st1:city></st1:place> and Fuxin
<link rel="stylesheet" href="../../formate/standard-1.css">
@geoffmcl
No problem. There are fragments of the file which Tidy has problems with:
<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:st1="urn:schemas-microsoft-com:office:smarttags"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 10">
<meta name=Originator content="Microsoft Word 10">
<title>XVII Congress of Association of Anarchist Movements</title>
<o:SmartTagType namespaceuri="urn:schemas-microsoft-com:office:smarttags"
name="country-region"/>
<o:SmartTagType namespaceuri="urn:schemas-microsoft-com:office:smarttags"
name="City"/>
<o:SmartTagType namespaceuri="urn:schemas-microsoft-com:office:smarttags"
name="place"/>
Word's XML data and the
Hello! I've processed with tidy 5.6.0 some old html pages which have been made in good old times with MS Word. I'd like to clean html code and therefore I've used tidy with option
--word-2000 yes
. In general tidy makes this work perfectly.But these pages have MS-specific "smart tags" and there I have a problem. Tidy doesn't processed pages which have tags
<o:smarttagtype>
and various<st1>
tags such as<st1:state>
,<st1:place>
, and so on.All errors that I get contains same things:
I tried other options such as -
-drop-proprietary-attributes yes
and--bare yes
, but I've got same errors.I would like to get advice about it or workaround for this issue.
Thanks.
OS: Fedora release 30 (Thirty) x86_64 Kernel: 5.4.7-100.fc30.x86_64 HTML Tidy for Linux version 5.6.0