pst-format / libpst

library for reading Microsoft Outlook PST files
GNU General Public License v2.0
15 stars 4 forks source link

Handling msg within an msg #14

Open tballison opened 2 months ago

tballison commented 2 months ago

Thank you so much for an awesome library. While writing a wrapper for readpst for Apache Tika, we noticed a small number of cases where there were fewer attachments when selecting the .msg output option. Tika's jira issue: https://issues.apache.org/jira/browse/TIKA-4250

We were able to reproduce this with a test file we have in our unit tests: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testPST.pst

The last email "8" is an email with an embedded email, and inside that embedded email is a docx file.

This is processed correctly with rfc822 and mbox output. However, there is no msg attachment within the 8.msg file.

tballison commented 2 months ago

test-pst.zip

I'm including the original pst, the mbox, the msg, the .eml and the debug file

tballison commented 2 months ago

Separately, we noticed that we're getting non-deterministic output when we select the .msg option. Sometimes we get 7 files and sometimes we get 8.

pabs3 commented 2 weeks ago

To be clear; the libpst library has a long history with many contributors, the current maintainers didn't create the library but try to merge patches promptly and work on it when they are able to.

Thanks for the report and the test files, we'll take a look when we can.

The issue with non-deterministic output is known and has a workaround in git master, please comment on the issue if you still see it with the latest commit:

https://github.com/pst-format/libpst/issues/7