nissl-lab / toxy

.net text extraction framework
Apache License 2.0
358 stars 107 forks source link

Support for MSG files #9

Closed Sicos1977 closed 8 years ago

Sicos1977 commented 9 years ago

Hi,

If you want to add support for Outlook MSG files then use this library --> https://github.com/Sicos1977/MSGReader

Greetings, Kees van Spelde

tonyqus commented 9 years ago

looks nice. We will consider to include it in the next version. Thank you!

2015-08-13 23:19 GMT+08:00 Kees notifications@github.com:

Hi,

If you want to add support for Outlook MSG files then use this library --> https://github.com/Sicos1977/MSGReader

Greetings, Kees van Spelde

— Reply to this email directly or view it on GitHub https://github.com/tonyqus/toxy/issues/9.

tonyqus commented 9 years ago

One question about msgreader:

I saw that you have RTF reader in your implementation but I didn't find any code that uses this reader. So where is it used?

Sicos1977 commented 9 years ago

Outlook stores it's HTML content embedded into RTF. The RTF reader is used to get the HTML content out of the RTF stream. See the ReadHtmlContent method in the DomDocument.cs file

    /// <summary>
    /// Read embedded Html content from rtf
    /// </summary>
    /// <param name="reader"></param>
    private void ReadHtmlContent(Reader reader)

        /// <summary>
        /// Returns the body of the Outlook message in HTML format.
        /// </summary>
        /// <value> The body of the Outlook message in HTML format. </value>
        public string BodyHtml
        {
            get
            {
                if (_bodyHtml != null)
                    return _bodyHtml;

                // Get value for the HTML MAPI property
                var htmlObject = GetMapiProperty(MapiTags.PR_BODY_HTML);
                string html = null;

                if (htmlObject is string)
                    html = htmlObject as string;
                else if (htmlObject is byte[])
                {
                    var htmlByteArray = htmlObject as byte[];
                    html = InternetCodePage.GetString(htmlByteArray);
                }

                // When there is no HTML found
                if (html == null)
                {
                    // Check if we have HTML embedded into rtf
                    var bodyRtf = BodyRtf;
                    if (bodyRtf != null)
                    {
                        var rtfDomDocument = new Rtf.DomDocument();
                        rtfDomDocument.LoadRtfText(bodyRtf);
                        if (!string.IsNullOrEmpty(rtfDomDocument.HtmlContent))
                            html = rtfDomDocument.HtmlContent.Trim('\r', '\n');
                    }
                }

                _bodyHtml = html;
                return _bodyHtml;
            }
        }
tonyqus commented 9 years ago

Hi,

I read the code of msgreader. I found that you are using Win32 API to read IStorage file (ActiveX document or OLE2). For toxy, these kind of dependency is not recommended as toxy also supports mono on Linux.

I'll modify msgreader to get rid of Win32 API.

Tony

Sicos1977 commented 9 years ago

That would be nice, just fork the code and push back the modifications. I also started a new project called MSGWriter this version also does not use the Win32 API anymore. But due to the lack of time I don't have that one finished yet.... probably going to take a few month.