Open GOTO10-DW opened 2 months ago
I assume you got the "unsupported code page" warning? This is the same issue as #9683. We can't really support all the legacy code pages; maybe there's a way to convert your document to unicode prior to passing it to pandoc?
I got no warning when i convert the document. if there is no mixed text, the convert runs fine.
OK, I jumped to conclusions. Actually it's ansicp1252, which we support, so the problem lies elsewhere...
Hm, cp1252 just has latin characters: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
There's \fcharset178
which would probably tell us the meaning of the bytes of character data, if we had the proper lookup table. RTF spec just says "Specifies the character set of a font in the font table. Values for N are defined by Windows header files, and in the file RTFDEFS.H accompanying this document." but I can't find RTFDEFS.H.
In any case this might be out of scope, if it requires large lookup tables corresponding to fonts (see discussion of the other linked issue).
Thanks for looking up my problem here. I can give you a working example if this helps. I'am quite not so fimiliar with rtf
{\rtf1\ansi\deflang1031\ftnbj\uc1\deff0
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \froman Times New Roman;}{\f2 \fnil Century Gothic;}{\f3 \fmodern Courier New;}{\f4 \fswiss Arial;}{\f5 \froman \fcharset0 arial;}{\f6 \froman \fcharset178 arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\f1\fs20\cf0\cb1\chcbpat1\ulc0 Normal;}{\cs1\cf0\cb1\chcbpat1\ulc0 Default Paragraph Font;}}
{\*\revtbl{Unknown;}}
\paperw12240\paperh15840\margl1080\margr1080\margt1080\margb1080\headery720\footery720\htmautsp1\nogrowautofit\deftab720\formshade\fet4\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd\pgwsxn12240\pghsxn15840\marglsxn1080\margrsxn1080\margtsxn1080\margbsxn1080\headery720\footery720\sbkpage\pgncont\pgndec
\plain\plain\f1\fs20\sb135\sa270\ql\sbauto1\saauto1\hich\f6\dbch\f6\loch\f6\fs24\rtlch\u1576 \'c8\u1726 \'aa\u1575 \'c7\u1585 \'d1\u1578 \'ca\u1740 \'3f\hich\f5\dbch\f5\loch\f5\ltrch \hich\f6\dbch\f6\loch\f6\rtlch\u1605 \'e3\u1740 \'3f\u1672 \'8f\u1740
\'3f\u1575 \'c7\hich\f5\dbch\f5\loch\f5\ltrch \hich\f6\dbch\f6\loch\f6\rtlch\u1575 \'c7\u1608 \'e6\u1585 \'d1\hich\f5\dbch\f5\loch\f5\ltrch \hich\f6\dbch\f6\loch\f6\rtlch\u1606 \'e4\u1740 \'3f\u1608 \'e6\u1586 \'d2\hich\f5\dbch\f5\loch\f5\ltrch \hich\f6\dbch\f6\loch\f6\rtlch\u1575
\'c7\u1740 \'3f\u1580 \'cc\u1606 \'e4\u1587 \'d3\u1740 \'3f\u1608 \'e6\u1722 \'9f\hich\f5\dbch\f5\loch\f5\ltrch \hich\f6\dbch\f6\loch\f6\rtlch\u1705 \'98\u1746 \'ff\hich\f5\dbch\f5\loch\f5\ltrch \hich\f6\dbch\f6\loch\f6\rtlch\u1605 \'e3\u1591 \'d8\u1575
\'c7\u1576 \'c8\u1602 \'de\hich\f5\dbch\f5\loch\f5\ltrch \hich\f6\dbch\f6\loch\f6\rtlch\u1740 \'3f\u1729 \'c0\hich\f5\dbch\f5\loch\f5\ltrch \hich\f6\dbch\f6\loch\f6\rtlch\u1606 \'e4\u1574 \'c6\u1740 \'3f\hich\f5\dbch\f5\loch\f5\ltrch \hich\f6\dbch\f6\loch\f6\rtlch\u1578
\'ca\u1601 \'dd\u1578 \'ca\u1740 \'3f\u1588 \'d4\hich\f5\dbch\f5\loch\f5\ltrch \hich\f6\dbch\f6\loch\f6\rtlch\u1605 \'e3\u1575 \'c7\u1604 \'e1\u1740 \'3f\u1575 \'c7\u1578 \'ca\u1740 \'3f\hich\f5\dbch\f5\loch\f5\ltrch \hich\f6\dbch\f6\loch\f6\rtlch\u1580
\'cc\u1585 \'d1\u1575 \'c7\u1574 \'c6\u1605 \'e3\hich\f5\dbch\f5\loch\f5\ltrch \hich\f6\dbch\f6\loch\f6\rtlch\u1662 \'81\u1585 \'d1\hich\f5\dbch\f5\loch\f5\ltrch \hich\f6\dbch\f6\loch\f6\rtlch\u1606 \'e4\u1711 \'90\u1575 \'c7\u1729 \'c0\hich\f5\dbch\f5\loch\f5\ltrch
\hich\f6\dbch\f6\loch\f6\rtlch\u1585 }
The one that works contains unicode escapes to back up the single-byte font characters; that's why it works. I think there might be programs that will unicodify an existing RTF document -- maybe Word can do this? You could look into it.
Explain the problem. Convert from RTF to HTML produces messed up Characters. I'am not sure if this is similar to this one https://github.com/jgm/pandoc/issues/9683 I use this Commandline
pandoc.exe input.rtf --metadata title=" " -f rtf -t html -s -o output.html
RTF (Input)
HTML (Output)
![%pn_XSgaaJ7lOd](https://github.com/jgm/pandoc/assets/94518824/cafbd513-f0c0-4295-8036-ca1329b06444)
Pandoc version? Pandoc 3.2 on Windows Server 2016