Closed TheElementalOfDestruction closed 5 years ago
If you have already read this before the time of this comment, I would recommend re-reading it as I just made some significant modifications to it.
Hi there, thank you for taking the time to dig into these problems and create this write up. Here are my comments:
Reproduced and fix incoming. I'm surprised I didn't notice this one in action already. It seems that formatConvertor
is not mentioned in the spec as a destination, which is the list I use to decide what text to ignore. Interestingly, generator
is mentioned, which is why you don't also see Microsoft Exchange Server;
in the output.
I like your idea to avoid discrepancies with Content-Type
header, but I need to think about it a bit more. As a contrived counter-example to the HTML-entity encode fix, {\*\htmltag64 <p'62}
should be output as <p>
not <p>
, so some range checking would be required.
On one hand, the output of the de-encapsulation is text, and assuming that the text conversion was done properly, then the Content-Type
header is irrelevant until the string is converted back to binary. In general, I think it should be the job of this HTML saving step to ensure that the header's encoding matches how the file is actually saved. One option might be to just output Buffers from the de-encapsulator, so that way we can just force utf-8 and set / overwrite the Content-Header
explicitly.
Fix incoming
I'm not even sure your contrived counter-example would be valid encapsulated html in the first place, and I'm actually looking at my documentation to see if I might be able to help by creating a better suggestion if it is valid. However, I did find an example that is completely valid. Take the following code as an example:
\viewkind5\viewscale100
\htmlrtf{\*\bkmkstart BM_BEGIN}\htmlrtf0{\*\htmltag64}{\*\htmltag0 <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 14 (filtered medium)"><style><!--\par
/* Font Definitions */\par
@font-face\par
\'09\{font-family:Wingdings;\par
\'09panose-1:5 0 0 0 0 0 0 0 0 0;\}\par
@font-face\par
\'09\{font-family:Wingdings;\par
\'09panose-1:5 0 0 0 0 0 0 0 0 0;\}\par
@font-face\par
\'09\{font-family:Calibri;\par
\'09panose-1:2 15 5 2 2 2 4 3 2 4;\}\par
@font-face\par
\'09\{font-family:Tahoma;\par
\'09panose-1:2 11 6 4 3 5 4 4 2 4;\}\par
/* Style Definitions */\par
Notice the \'09
at the beginning of several of the lines? Those SHOULD be converted to their actual symbols (tabs). So how should the program determine which should and shouldn't be converted? Should it check what blocks it is in or the value? But the value definitely wouldn't work if your contrived example is valid. Lets continue it and say that the text ALSO has a ">" that should be written as text instead of being part of a tag. in that case, one should be converted to ">" and the other should be >
(or >
).
sigh It would be great if RTF was standardized to say "There is one and only one way to do this." Might also make it more resistant to attacks.
Side note: I'd love to help extend this program, but I know very little JavaScript. Unfortunately, my strong suit is Python :P
I hate to say it, but I can confirm that your convoluted example WOULD be considered as valid encapsulated html according to Microsoft Office 2016. I tested it myself (Not a fun process, let me tell you) and it properly converted \'3c
("<") and \'3e
(">").
Also, keep in mind that the format is \'HH
where H is a hexadecimal digit. Your example is valid as long as you actually have the character code in hexadecimal :P
I believe this issue has been addressed with the improvements and additional options in version 3.x.
Now, sometimes the rtf file will be generated in a different way from normal, and that is causing A LOT of issues. Certain special characters don't get written to the html file properly, some information that shouldn't be in the html file gets put in (sometimes even as raw text), formatting gets messed up in numbered/bulleted lists, etc.. I believe I have figured out their exact causes, but I am not absolutely sure.
{\*\formatConverter converted from html;}
should NOT be in the html file. However, it gets written as plain text as "converted from html;". That one should be simple enough; it seems to just be a tag that isn't handled correctly.so that the de-encapsulated output should be
And would be displayed as
Unfortunately, the code may look something like this:
(((I cut out some of the raw text from the example. None of it contained any code, and was definitely not important to this issue.)))
Now, the output will be
but it probably should show up as
As you can see, in the actual html code, it shows up as the proper character. However, because of this line in the header:
the character is not read as UTF-8 like it should. and displays as
Lastly, they’ll
This situation is the case for bullet points as well, as they are UTF-8 characters. The best way to solve this would be to replace characters written as
\'hh
(where h is a hexadecimal digit) as their html escaped versions. That way the character set doesn't have to be modified to make these characters work. Since Your code succesfully converts the character to UTF-8, I would recommend taking that character, converting it to unicode. Once there, take the character's decimal value, convert it to a string, add a "&#" to the beginning, and then add a ";" to the end so that it is html escaped.The current handling of this tag often causes each bullet to be doubled (normal bullets have a "*" instead of the actual bullet, but numbered lists have a double number).
That's all I have for now.