Open ilappe opened 1 month ago
Hi @seamustuohy
Sample RTF which reproduce the issue
{\rtf1\ansi\ansicpg936\fromhtml1\fbidis\deff0
{\fonttbl
{\f0\fswiss\fcharset134 Simsun;}
{\f1\fmodern Simsun;}
{\f2\fnil\fcharset2 Symbol;}
{\f3\fmodern\fcharset0 Courier New;}
}
{\colortbl\red0\green0\blue0;\red0\green0\blue255;\red71\green71\blue71;\red255\green255\blue255;}
\uc1\pard\plain\deftab360\f0\fs24
{\*\htmltag243 <!DOCTYPE HTML>}
{\*\htmltag19 <html lang="en-US">}\htmlrtf\lang1033\htmlrtf0
{\*\htmltag19 <body>}
{\*\htmltag1 \par This is a sample RTF document with embedded HTML content.}
{\*\htmltag1 \par The following is a list with middle dots:}
{\*\htmltag1 \par }
{\*\htmltag1 \par \pard\plain\f0\fs24
\htmlrtf \pard\plain\f0\fs24 \htmlrtf0\htmltag34 <ul>\htmltag1 \par
\htmltag34 <li>\htmltag1 Item 1 \htmltag34 \htmltag1 \'b7\htmltag34 \htmltag1</li>\htmltag1 \par
\htmltag34 <li>\htmltag1 Item 2 \htmltag34 \htmltag1 \'b7\htmltag34 \htmltag1</li>\htmltag1 \par
\htmltag34 <li>\htmltag1 Item 3 \htmltag34 \htmltag1 \'b7\htmltag34 \htmltag1</li>\htmltag1 \par
\htmltag34 </ul>\htmltag1 \par }
{\*\htmltag1 \par Here is a Chinese character: 汉 }
{\*\htmltag19 </body>}
{\*\htmltag19 </html>}
}
I don't know if this is a bug, a badly mangled (or actively misconstructed) .MSG file, or an expected edge case in the MSFT-RTF or MS-OXRTFEX specifications. I can't find it in the specifications so I'm assuming it is a mangled .msg file.
I am very wary of adding in any fallback functionality that is not explicitly called for in the specification. My default is to have this library fail closed when it encounters improperly formatted RTF so that its adherence to the specs can be relied on by security tools.
If you can identify where in the MSFT-RTF specification it says that decoding should fall back past the character sets defined in the header. Or find some other element of the de-encapsulation in the MS-OXRTFEX specification which I am doing incorrectly and would cause this to fail open then I will happily update the library to support it.
Details about why RTFDE is making the decisions it is making:
\ansicpg1252
as the code page to use to decode \'b7'
it would have decoded properly.\ansi
, \mac
, \pc
, \pca
). If this document had just used the \ansi
it would have defaulted to \ansicpg1252
and decoded these correctly. \ansicpg1252
. \ansicpg936
which is the code page 936
(the gkb
codex). \f0
through the \deff0
control word. \f0
is set to use the GB2312_CHARSET
. That also translates to code page 936
. \f0
. \'b7'
hex encoded char we have been told explicitly that we need to decode this byte using the 'gkb' codex.
\ucN
or \uN
. But, that was not done here.
in a msg file which identified as encoded with
gkb
i have seen the following error:'gbk' codec can't decode byte 0xb7 in position 0: incomplete multibyte sequenceTrace:
when i use the following code