jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.5k stars 3.32k forks source link

Converting mixed urdu and english text RTF to HTML causes messed up characters #9758

Open GOTO10-DW opened 2 months ago

GOTO10-DW commented 2 months ago

Explain the problem. Convert from RTF to HTML produces messed up Characters. I'am not sure if this is similar to this one https://github.com/jgm/pandoc/issues/9683 I use this Commandline pandoc.exe input.rtf --metadata title=" " -f rtf -t html -s -o output.html

{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1031{\fonttbl{\f0\fswiss\fprq2\fcharset0 Calibri;}{\f1\fswiss\fprq2\fcharset178 Calibri;}{\f2\fnil\fcharset0 Arial;}}
{\*\generator Riched20 10.0.14393}\viewkind4\uc1 
\pard\rtlpar\widctlpar\qr\f0\fs22\line\f1\rtlch\lang1025\'c7\'e4\'d1\'cc\u1740? \'ca\'aa\'e4\'98 \'8a\u1740?\'e4\'98 \'c7\'e3\'c8\'d1 \'98\'ff \'e3\'d8\'c7\'c8\'de \'c0\'e6\'c7 \'d3\'ff \'c8\'cc\'e1\u1740? \'98\u1740? \'81\u1740?\'cf\'c7\'e6\'c7\'d1 \'98\'e6 \'c8\'9a\'aa\'c7\'e4\'ff \'e3\u1740?\'9f \u1740?\'e6\'d1\'81\u1740?\'e4 \u1740?\'e6\'e4\u1740?\'e4 \'98\'ff \'e3\'e3\'c7\'e1\'982022\f0\ltrch\lang1031  \~\f1\rtlch\lang1025\'e3\u1740?\'9f \'81\u1740?\'8d\'aa\'ff \'d1\'c0 \f0\ltrch\lang1031\par
\par

\pard\rtlpar\qr\f2\fs24\par

\pard\ltrpar\par
 Brussels (dpa) - European Union countries fell behind in 2022 on\par
expanding wind power generation, a study by the energy think tank\par
Ember found.\par
\par
} 

RTF (Input) %pn_0Mbr5WNEzQ HTML (Output) %pn_XSgaaJ7lOd

Pandoc version? Pandoc 3.2 on Windows Server 2016

jgm commented 2 months ago

I assume you got the "unsupported code page" warning? This is the same issue as #9683. We can't really support all the legacy code pages; maybe there's a way to convert your document to unicode prior to passing it to pandoc?

GOTO10-DW commented 2 months ago

I got no warning when i convert the document. if there is no mixed text, the convert runs fine.

jgm commented 2 months ago

OK, I jumped to conclusions. Actually it's ansicp1252, which we support, so the problem lies elsewhere...

jgm commented 2 months ago

Hm, cp1252 just has latin characters: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

There's \fcharset178 which would probably tell us the meaning of the bytes of character data, if we had the proper lookup table. RTF spec just says "Specifies the character set of a font in the font table. Values for N are defined by Windows header files, and in the file RTFDEFS.H accompanying this document." but I can't find RTFDEFS.H.

jgm commented 2 months ago

In any case this might be out of scope, if it requires large lookup tables corresponding to fonts (see discussion of the other linked issue).

GOTO10-DW commented 2 months ago

Thanks for looking up my problem here. I can give you a working example if this helps. I'am quite not so fimiliar with rtf

{\rtf1\ansi\deflang1031\ftnbj\uc1\deff0
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \froman Times New Roman;}{\f2 \fnil Century Gothic;}{\f3 \fmodern Courier New;}{\f4 \fswiss Arial;}{\f5 \froman \fcharset0 arial;}{\f6 \froman \fcharset178 arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\f1\fs20\cf0\cb1\chcbpat1\ulc0 Normal;}{\cs1\cf0\cb1\chcbpat1\ulc0 Default Paragraph Font;}}
{\*\revtbl{Unknown;}}
\paperw12240\paperh15840\margl1080\margr1080\margt1080\margb1080\headery720\footery720\htmautsp1\nogrowautofit\deftab720\formshade\fet4\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd\pgwsxn12240\pghsxn15840\marglsxn1080\margrsxn1080\margtsxn1080\margbsxn1080\headery720\footery720\sbkpage\pgncont\pgndec
\plain\plain\f1\fs20\sb135\sa270\ql\sbauto1\saauto1\hich\f6\dbch\f6\loch\f6\fs24\rtlch\u1576 \'c8\u1726 \'aa\u1575 \'c7\u1585 \'d1\u1578 \'ca\u1740 \'3f\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1605 \'e3\u1740 \'3f\u1672 \'8f\u1740 
\'3f\u1575 \'c7\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1575 \'c7\u1608 \'e6\u1585 \'d1\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1606 \'e4\u1740 \'3f\u1608 \'e6\u1586 \'d2\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1575 
\'c7\u1740 \'3f\u1580 \'cc\u1606 \'e4\u1587 \'d3\u1740 \'3f\u1608 \'e6\u1722 \'9f\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1705 \'98\u1746 \'ff\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1605 \'e3\u1591 \'d8\u1575 
\'c7\u1576 \'c8\u1602 \'de\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1740 \'3f\u1729 \'c0\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1606 \'e4\u1574 \'c6\u1740 \'3f\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1578 
\'ca\u1601 \'dd\u1578 \'ca\u1740 \'3f\u1588 \'d4\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1605 \'e3\u1575 \'c7\u1604 \'e1\u1740 \'3f\u1575 \'c7\u1578 \'ca\u1740 \'3f\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1580 
\'cc\u1585 \'d1\u1575 \'c7\u1574 \'c6\u1605 \'e3\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1662 \'81\u1585 \'d1\hich\f5\dbch\f5\loch\f5\ltrch  \hich\f6\dbch\f6\loch\f6\rtlch\u1606 \'e4\u1711 \'90\u1575 \'c7\u1729 \'c0\hich\f5\dbch\f5\loch\f5\ltrch 
 \hich\f6\dbch\f6\loch\f6\rtlch\u1585  }

image

jgm commented 2 months ago

The one that works contains unicode escapes to back up the single-byte font characters; that's why it works. I think there might be programs that will unicodify an existing RTF document -- maybe Word can do this? You could look into it.