mazira / rtf-stream-parser

Contains native Node classes for transforming an RTF byte stream into tokens, and de-encapsulating HTML
MIT License
23 stars 4 forks source link

\ansicpg0 throws error "text with no codepage" #12

Open aghster opened 2 years ago

aghster commented 2 years ago

Hello,

rts-stream-parser throws an error "text with no codepage" when I try to decode an email that contains RTF starting with {\\rtf1\\ansi\\ansicpg0\\fromhtml1\\deff0{\\fonttbl\r\n{\\f0\\fswiss\\fcharset0 Arial;}\r\n{\\f1\\fmodern\\fcharset0 Courier New;}\r\n{\\f2\\fnil\\fcharset0 Symbol;}\r\n{\\f3\\fmodern\\fcharset0 Courier New;}}\r\n\\uc1\\pard\\plain\\deftab360 \\f0\\fs24\r\n{\\*\\htmltag0 <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">}.

To my understanding, this is due to \ansicpg0, as this is not a valid codepage. However, I suggest that in such cases rts-stream-parser should not just throw an error and abort, but should fall back to a default codepage instead. A simple solution would be changing the following lines https://github.com/mazira/rtf-stream-parser/blob/3ec37609e256c0a0a91649f4145e44ed91e33003/src/ProcessTokens.ts#L111-L113 to

            const cpg = font
                ? font.cpg || font.fcharsetCpg || this._cpg || 1252
                : this._cpg || 1252;

or simpler to

            const cpg = (font && (font.cpg || font.fcharsetCpg)) || this._cpg || 1252;

I chose 1252 as the default codepage. Ideally, though, the default codepage is not hard-coded, but can be set as an option ...

rossj commented 2 years ago

Thanks for submitting. Are you able to share the complete email or RTF? You can email it to support@goldfynch.com.

Regarding the fix, I think the issue may be that the code page is actually set (to 0), but the check on line 120 is checking for a truthy value instead of a defined value. If the check was changed to check that cpg is defined, then your decode callback would get "cp0" as the encoding and you could then handle as you see fit. What do you think?

aghster commented 2 years ago

Thank you for your quick answer!

Are you able to share the complete email or RTF?

Unfortunately, I cannot share the complete email or RTF and I don't have another example in which the RTF contains \ansicpg0.

Regarding the fix, I think the issue may be that the code page is actually set (to 0), but the check on line 120 is checking for a truthy value instead of a defined value. If the check was changed to check that cpg is defined, then your decode callback would get "cp0" as the encoding and you could then handle as you see fit. What do you think?

Thank you for your alternative suggestion of how to fix the issue. You're certainly right that the error should only be thrown if cpg is undefined, and that 0 should be treated like any other codepage number. So if the check in line 121 is changed so as to check that cpg is defined, I would consider this issue fixed.

Nevertheless, I still think that having an option to define a fallback codepage would in general be useful.