Problem with `0xb7` byte in MSG file

ilappe commented 1 month ago

in a msg file which identified as encoded with gkb i have seen the following error: 'gbk' codec can't decode byte 0xb7 in position 0: incomplete multibyte sequenceTrace: when i use the following code

from RTFDE.deencapsulate import DeEncapsulator
rtf_obj = DeEncapsulator(rtf_body)
rtf_obj.deencapsulate()

ilappe commented 1 month ago

Hi @seamustuohy

Sample RTF which reproduce the issue

{\rtf1\ansi\ansicpg936\fromhtml1\fbidis\deff0
{\fonttbl
{\f0\fswiss\fcharset134 Simsun;}
{\f1\fmodern Simsun;}
{\f2\fnil\fcharset2 Symbol;}
{\f3\fmodern\fcharset0 Courier New;}
}
{\colortbl\red0\green0\blue0;\red0\green0\blue255;\red71\green71\blue71;\red255\green255\blue255;}
\uc1\pard\plain\deftab360\f0\fs24
{\*\htmltag243 <!DOCTYPE HTML>}
{\*\htmltag19 <html lang="en-US">}\htmlrtf\lang1033\htmlrtf0
{\*\htmltag19 <body>}
{\*\htmltag1 \par This is a sample RTF document with embedded HTML content.}
{\*\htmltag1 \par The following is a list with middle dots:}
{\*\htmltag1 \par }
{\*\htmltag1 \par \pard\plain\f0\fs24
\htmlrtf \pard\plain\f0\fs24 \htmlrtf0\htmltag34 <ul>\htmltag1 \par 
\htmltag34 <li>\htmltag1 Item 1 \htmltag34 \htmltag1 \'b7\htmltag34 \htmltag1</li>\htmltag1 \par 
\htmltag34 <li>\htmltag1 Item 2 \htmltag34 \htmltag1 \'b7\htmltag34 \htmltag1</li>\htmltag1 \par 
\htmltag34 <li>\htmltag1 Item 3 \htmltag34 \htmltag1 \'b7\htmltag34 \htmltag1</li>\htmltag1 \par 
\htmltag34 </ul>\htmltag1 \par }
{\*\htmltag1 \par Here is a Chinese character: 汉 }
{\*\htmltag19 </body>}
{\*\htmltag19 </html>}
}

seamustuohy commented 1 month ago

I don't know if this is a bug, a badly mangled (or actively misconstructed) .MSG file, or an expected edge case in the MSFT-RTF or MS-OXRTFEX specifications. I can't find it in the specifications so I'm assuming it is a mangled .msg file.

I am very wary of adding in any fallback functionality that is not explicitly called for in the specification. My default is to have this library fail closed when it encounters improperly formatted RTF so that its adherence to the specs can be relied on by security tools.

If you can identify where in the MSFT-RTF specification it says that decoding should fall back past the character sets defined in the header. Or find some other element of the de-encapsulation in the MS-OXRTFEX specification which I am doing incorrectly and would cause this to fail open then I will happily update the library to support it.

Details about why RTFDE is making the decisions it is making:

If this file and the font definitions had set \ansicpg1252 as the code page to use to decode \'b7' it would have decoded properly.
When an encapsulated RTF file is created it can skip including the code page and use the default set by the documents character set (\ansi, \mac, \pc, \pca). If this document had just used the \ansi it would have defaulted to \ansicpg1252 and decoded these correctly.
Many devices set the default codepage for the the run-time environment of the computer in use. For a U.S. Windows install this is \ansicpg1252.
The header is setting the default ANSI code page in this document to use \ansicpg936 which is the code page 936 (the gkb codex).
The header then defines the default font to use font \f0 through the \deff0 control word.
In the font-table the font \f0 is set to use the GB2312_CHARSET. That also translates to code page 936.
The last relevant font change in the doc is on line 16 of the rtf file where the font is explicitly set to \f0.
So, when we get to the first \'b7' hex encoded char we have been told explicitly that we need to decode this byte using the 'gkb' codex.
- NOTE: The way that a file which wanted to use this character without using a different font would properly get around this would be to encode those characters using the unicode RTF format \ucN or \uN. But, that was not done here.

seamustuohy / RTFDE

Problem with `0xb7` byte in MSG file #33