loftuxab / alfresco-community-loftux

Alfresco Community by Loftux
https://loftux.com
GNU Lesser General Public License v3.0
10 stars 145 forks source link

EMLTransformer ignoring multipart emails #22

Closed loftux closed 9 years ago

loftux commented 9 years ago

The transformer for RFC822 messages EMLTransformer.java has a severe bug that for those who store a lot of emails impacts performance. The transformation of Multipart emails will always return the entire email, including attachments base64 text.

For indexing this results in indexing the plain text of base64 encoded attachment. A client of mine with 100.000+ emails could pretty much enter any character combination and get a hit. The index file size became 300+GB. Preview of EML files, can get 300+ pages long in PdfJS viewer, since the the attachment base64 text is displayed. How to reproduce

Note: A long outstanding issue is that html part of email plain text is included when transforming. So you would probably see html as part of the transformation. What is the cause?

In the EMLTransformer.java row 85-90 the mimetype is set to text/plain on the message. This destroys the message actual type of being multipart, so when the getContent is called it is always a string and never instanceof Multipart. Just remove that and it works. It may have been needed with javax.mail 1.4.x, but it seem like it is not needed now with 1.5.x. I will also have a look at making sure that that a plain text transformation does not include the html part of the message, and create a transformer that can pick out the html part and use that if available.

https://issues.alfresco.com/jira/browse/ALF-21259

loftux commented 9 years ago

If there is a winmail.dat attachment currently just ignore it. There is a Java library to read those, see http://www.freeutils.net/source/jtnef/ Other third-party tools http://www.oracle.com/technetwork/java/javamail/third-party-136965.html This comment for reference should there in the future become a need for extending the transformation support.