mazira / rtf-stream-parser

Contains native Node classes for transforming an RTF byte stream into tokens, and de-encapsulating HTML
MIT License
23 stars 4 forks source link

Msg Email isn't fully converted #16

Open JamesDerrickHH opened 2 months ago

JamesDerrickHH commented 2 months ago

Not 100% sure if this is an issue with this package exactly, but it'd be great if you could help me track down the problem. I'm using @kenjiuno/msgreader which decompresses the rtf for this package to then convert to html In essence my code looks like this

const buffer: ArrayBuffer = (file instanceof File) ? await file.arrayBuffer() : await file.getBuffer()
const message = new MsgReader(buffer)
const parsedMsg = message.getFileData()
const rtfBlob = decompressRTF(Array.from(compressedRtf))
const rtfText = iconvLite.decode(Buffer.from(rtfBlob),'utf8')
html = deEncapsulateSync(rtfText, { decode: iconvLite.decode }).text as string;

where file is a .msg file I am basically only getting the heading of the email back and not the rest of it Using this demo page MSG Reader Demo 3 I can see that the rtf decompressed seems to have everything from the email , but when converted most of the content is gone. Most .msg files I have tested work fine but some do not, I cannot workout what it is about the decompressed rtf that the reader doesn't like

JamesDerrickHH commented 2 months ago

I probably can't send you the actual email as it's work related, but here is a screenshot of what I get using the msg reader demo 3 site. As you can see their is plenty more decompressed rtf in the email that hasn't been converted Screenshot 2024-09-17 104645

rossj commented 1 month ago

Hi James, most likely the RTF data is a bit malformed and actually ends early via a } that ends the top level RTF group. It is a bit ambiguous whether a reader is supposed to ignore group brackets ({, }) within an htmlrtf ignore area, but I've seen emails that only make sense with one interpretation or the other.

You can try passing outlookQuirksMode: true as an additional option to deEncapsulateSync() - this will ignore group brackets within an htmlrtf ignore area and may give you longer output closer to what you expect. You can also try passing a warn option as well, e.g. warn: console.log, which may give you some useful output information as well.

If you're able & willing, you could email the problem email or RTF to me as ross at mazira dot com and I can inspect it further.

JamesDerrickHH commented 1 month ago

Hi Ross, thanks so much for your reply Early days but outlookQuirksMode seems to have fixed any issues we were having There were only a few that were giving us problems but they all seem fine now

rossj commented 1 month ago

I just wanted to note that I've seen it both ways - messages that appear longer / more correct without the quirks mode parsing, and messages that appear longer / more correct with the quirks mode parsing. You will have to decide if you want to try and always match what Outlook shows (I think it uses quirks mode but I don't remember off the top of my head), or perhaps you could try and always use the "best" by running it both ways and comparing the output / using the longest output.