pst-format / libpst

library for reading Microsoft Outlook PST files
GNU General Public License v2.0
16 stars 4 forks source link

filenames of attachments are not properly encoded #1

Closed brychcy closed 2 years ago

brychcy commented 2 years ago

When reading the .eml files generated by readpst, the filenames of some attachments could not be retrieved with the python e-mail library, e.g if they contain parenthesis like filename*=utf-8''Status_Announced_Invoice(s).pdf;

It turned out that rfc2231_string in readpst.c doesn't escape all characters as required.

Correct would be filename*=utf-8''Status_Announced_Invoice%28s%29.pdf;

pabs3 commented 2 years ago

Thanks for the report.

Could you attach an example PST file? I will need it to verify the current incorrect behaviour and review the changes that are made to the behaviour by your patch.

-- bye, pabs

https://bonedaddy.net/pabs3/

pabs3 commented 2 years ago

Please also attach an example script using the Python email library.

-- bye, pabs

https://bonedaddy.net/pabs3/

brychcy commented 2 years ago

libpst1.pst.gz

As requested, example PST file (compressed with gzip as required by github).

The name of the attached file in the contained mail is "Hello-(123)-World.pdf"

brychcy commented 2 years ago

A simple python script for printing the file names of PDF attachments:

#!/usr/bin/env python3
# use python 3.6 or later

import email
import email.policy

# point this to the file generated with "readpst -e ..."
filename = "/Users/till/opensource/libpst/test/patched/Outlook-Datendatei/libpst1/1.eml"

with open(filename, "rb") as f:
    msg = email.message_from_binary_file(f, policy=email.policy.default)

# uncomment the following lines to print the structure
# from email.iterators import _structure
# _structure(msg)

found = False
for part in msg.walk():
    if part.is_attachment() and part.get_content_type() == 'application/pdf':
        filename = part.get_filename(failobj="")
        found = True
        print("found: " + filename)

if not found:
    print("no pdf found!")