seamustuohy / RTFDE

RTFDE: RTF De-Encapsulator - A python3 library for extracting encapsulated `HTML` & `plain text` content from the `RTF` bodies of .msg files.
GNU Lesser General Public License v3.0
7 stars 3 forks source link

Problem with `0xb7` byte in MSG file #33

Open ilappe opened 1 month ago

ilappe commented 1 month ago

in a msg file which identified as encoded with gkb i have seen the following error: 'gbk' codec can't decode byte 0xb7 in position 0: incomplete multibyte sequenceTrace: when i use the following code

from RTFDE.deencapsulate import DeEncapsulator
rtf_obj = DeEncapsulator(rtf_body)
rtf_obj.deencapsulate()
ilappe commented 1 month ago

Hi @seamustuohy

Sample RTF which reproduce the issue

{\rtf1\ansi\ansicpg936\fromhtml1\fbidis\deff0
{\fonttbl
{\f0\fswiss\fcharset134 Simsun;}
{\f1\fmodern Simsun;}
{\f2\fnil\fcharset2 Symbol;}
{\f3\fmodern\fcharset0 Courier New;}
}
{\colortbl\red0\green0\blue0;\red0\green0\blue255;\red71\green71\blue71;\red255\green255\blue255;}
\uc1\pard\plain\deftab360\f0\fs24
{\*\htmltag243 <!DOCTYPE HTML>}
{\*\htmltag19 <html lang="en-US">}\htmlrtf\lang1033\htmlrtf0
{\*\htmltag19 <body>}
{\*\htmltag1 \par This is a sample RTF document with embedded HTML content.}
{\*\htmltag1 \par The following is a list with middle dots:}
{\*\htmltag1 \par }
{\*\htmltag1 \par \pard\plain\f0\fs24
\htmlrtf \pard\plain\f0\fs24 \htmlrtf0\htmltag34 <ul>\htmltag1 \par 
\htmltag34 <li>\htmltag1 Item 1 \htmltag34 \htmltag1 \'b7\htmltag34 \htmltag1</li>\htmltag1 \par 
\htmltag34 <li>\htmltag1 Item 2 \htmltag34 \htmltag1 \'b7\htmltag34 \htmltag1</li>\htmltag1 \par 
\htmltag34 <li>\htmltag1 Item 3 \htmltag34 \htmltag1 \'b7\htmltag34 \htmltag1</li>\htmltag1 \par 
\htmltag34 </ul>\htmltag1 \par }
{\*\htmltag1 \par Here is a Chinese character: 汉 }
{\*\htmltag19 </body>}
{\*\htmltag19 </html>}
}
seamustuohy commented 1 month ago

I don't know if this is a bug, a badly mangled (or actively misconstructed) .MSG file, or an expected edge case in the MSFT-RTF or MS-OXRTFEX specifications. I can't find it in the specifications so I'm assuming it is a mangled .msg file.

I am very wary of adding in any fallback functionality that is not explicitly called for in the specification. My default is to have this library fail closed when it encounters improperly formatted RTF so that its adherence to the specs can be relied on by security tools.

If you can identify where in the MSFT-RTF specification it says that decoding should fall back past the character sets defined in the header. Or find some other element of the de-encapsulation in the MS-OXRTFEX specification which I am doing incorrectly and would cause this to fail open then I will happily update the library to support it.

Details about why RTFDE is making the decisions it is making: