mazira / rtf-stream-parser

Contains native Node classes for transforming an RTF byte stream into tokens, and de-encapsulating HTML
MIT License
23 stars 4 forks source link

Some pretty big issues #4

Closed TheElementalOfDestruction closed 5 years ago

TheElementalOfDestruction commented 6 years ago

Now, sometimes the rtf file will be generated in a different way from normal, and that is causing A LOT of issues. Certain special characters don't get written to the html file properly, some information that shouldn't be in the html file gets put in (sometimes even as raw text), formatting gets messed up in numbered/bulleted lists, etc.. I believe I have figured out their exact causes, but I am not absolutely sure.

  1. Sometimes text gets pulled that shouldn't. For example, this part of an rtf file: {\*\formatConverter converted from html;} should NOT be in the html file. However, it gets written as plain text as "converted from html;". That one should be simple enough; it seems to just be a tag that isn't handled correctly.

  1. Sometimes certain special characters change the way that special characters are represented in the encapsulated html. For example, normally text with an apostrophe is written something like this:
    {\*\htmltag64 <p class=MsoNormal>}\htmlrtf {\htmlrtf0 Here is the test e-mail for today
    {\*\htmltag84 &#8217;}\htmlrtf \'92\htmlrtf0 s test

so that the de-encapsulated output should be

<p class=MsoNormal>Here is the test e-mail for today&#8217;s test

And would be displayed as

Here is the test e-mail for today’s test

Unfortunately, the code may look something like this:

\htmlrtf0{\*\htmltag0 <p class=MsoNormal>}\pard\plain\htmlrtf{\f2\lang1033\fs22\htmlrtf0 Lastly, they\'92ll {\*\htmltag0 <o:p></o:p></P>}\htmlrtf

(((I cut out some of the raw text from the example. None of it contained any code, and was definitely not important to this issue.)))

Now, the output will be

<p class=MsoNormal>Lastly, they’ll

but it probably should show up as

<p class=MsoNormal>Lastly, they&#8217;ll

As you can see, in the actual html code, it shows up as the proper character. However, because of this line in the header:

<meta http-equiv=Content-Type content="text/html; charset=us-ascii">

the character is not read as UTF-8 like it should. and displays as Lastly, they’ll

This situation is the case for bullet points as well, as they are UTF-8 characters. The best way to solve this would be to replace characters written as \'hh (where h is a hexadecimal digit) as their html escaped versions. That way the character set doesn't have to be modified to make these characters work. Since Your code succesfully converts the character to UTF-8, I would recommend taking that character, converting it to unicode. Once there, take the character's decimal value, convert it to a string, add a "&#" to the beginning, and then add a ";" to the end so that it is html escaped.


  1. "\pntext" seems to be improperly handled. According to rtf specification:

This group precedes all numbered/bulleted paragraphs and contains all automatically generated text and formatting. It should precede the '{*' \pn … '}' destination, and it is the responsibility of RTF readers that understand the '{*' \pn … '}' destination to ignore this preceding group. This is a destination control word.

The current handling of this tag often causes each bullet to be doubled (normal bullets have a "*" instead of the actual bullet, but numbered lists have a double number).


That's all I have for now.

TheElementalOfDestruction commented 6 years ago

If you have already read this before the time of this comment, I would recommend re-reading it as I just made some significant modifications to it.

rossj commented 6 years ago

Hi there, thank you for taking the time to dig into these problems and create this write up. Here are my comments:

1. formatConverter

Reproduced and fix incoming. I'm surprised I didn't notice this one in action already. It seems that formatConvertor is not mentioned in the spec as a destination, which is the list I use to decide what text to ignore. Interestingly, generator is mentioned, which is why you don't also see Microsoft Exchange Server; in the output.

2. HTML encoding

I like your idea to avoid discrepancies with Content-Type header, but I need to think about it a bit more. As a contrived counter-example to the HTML-entity encode fix, {\*\htmltag64 <p'62} should be output as <p> not <p&#62;, so some range checking would be required.

On one hand, the output of the de-encapsulation is text, and assuming that the text conversion was done properly, then the Content-Type header is irrelevant until the string is converted back to binary. In general, I think it should be the job of this HTML saving step to ensure that the header's encoding matches how the file is actually saved. One option might be to just output Buffers from the de-encapsulator, so that way we can just force utf-8 and set / overwrite the Content-Header explicitly.

3. Lists

Fix incoming

TheElementalOfDestruction commented 6 years ago

I'm not even sure your contrived counter-example would be valid encapsulated html in the first place, and I'm actually looking at my documentation to see if I might be able to help by creating a better suggestion if it is valid. However, I did find an example that is completely valid. Take the following code as an example:

\viewkind5\viewscale100
\htmlrtf{\*\bkmkstart BM_BEGIN}\htmlrtf0{\*\htmltag64}{\*\htmltag0 <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 14 (filtered medium)"><style><!--\par
/* Font Definitions */\par
@font-face\par
\'09\{font-family:Wingdings;\par
\'09panose-1:5 0 0 0 0 0 0 0 0 0;\}\par
@font-face\par
\'09\{font-family:Wingdings;\par
\'09panose-1:5 0 0 0 0 0 0 0 0 0;\}\par
@font-face\par
\'09\{font-family:Calibri;\par
\'09panose-1:2 15 5 2 2 2 4 3 2 4;\}\par
@font-face\par
\'09\{font-family:Tahoma;\par
\'09panose-1:2 11 6 4 3 5 4 4 2 4;\}\par
/* Style Definitions */\par

Notice the \'09 at the beginning of several of the lines? Those SHOULD be converted to their actual symbols (tabs). So how should the program determine which should and shouldn't be converted? Should it check what blocks it is in or the value? But the value definitely wouldn't work if your contrived example is valid. Lets continue it and say that the text ALSO has a ">" that should be written as text instead of being part of a tag. in that case, one should be converted to ">" and the other should be &#62; (or &gt;).

sigh It would be great if RTF was standardized to say "There is one and only one way to do this." Might also make it more resistant to attacks.

Side note: I'd love to help extend this program, but I know very little JavaScript. Unfortunately, my strong suit is Python :P

TheElementalOfDestruction commented 6 years ago

I hate to say it, but I can confirm that your convoluted example WOULD be considered as valid encapsulated html according to Microsoft Office 2016. I tested it myself (Not a fun process, let me tell you) and it properly converted \'3c ("<") and \'3e (">").

Also, keep in mind that the format is \'HH where H is a hexadecimal digit. Your example is valid as long as you actually have the character code in hexadecimal :P

rossj commented 5 years ago

I believe this issue has been addressed with the improvements and additional options in version 3.x.