Possible Bug: one instance of <meta> tag always missing when reading emails from PST

Nick-P-Orr commented 3 years ago

I've been developing a Python tool to ingest and write all emails from a PST exported from Outlook to individual .html files. The issue is that when opening the PST in outlook and checking the source information for emails individually, it includes this specific line:

<meta http-equiv=Content-Type content="text/html; charset=utf-8">

which IS NOT being included when importing the PST with Pywin32 and reading all the emails in the PST. To see what it looks like in a chunk -

From Outlook: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><meta name=Generator content="Microsoft Word 15 (filtered medium)">

What is exported from the tool: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta name=Generator content="Microsoft Word 15 (filtered medium)">

The contents of the emails are otherwise ENTIRELY identical except for that one tag.

My code:

htmlEmails = 0
encryptedEmails = 0
totalEmails = 0
richPlainEmails = 0
filenameCount = 1
mycounter2 = 1

#Adjusting name of PST location to be readable
selectedPST = str(selectedPST.replace('/', '\\'))
print('\nRunning:' , selectedPST)
outlook.AddStore(selectedPST)
PSTFolderObj = find_pst_folder(outlook, selectedPST)

def find_pst_folder(OutlookObj, pst_filepath):
    for Store in OutlookObj.Stores:
        if Store.IsDataFileStore  and Store.FilePath == pst_filepath:
            return Store.GetRootFolder()
    return None

def enumerate_folders(FolderObj):
    for ChildFolder in FolderObj.Folders:
        enumerate_folders(ChildFolder)
    iterate_messages(FolderObj)

def iterate_messages(FolderObj):
    global mycounter2
    global encryptedEmails
    global richPlainEmails
    global totalEmails
    global htmlEmails

    for item in FolderObj.Items:
        totalEmails += 1
        try:
            try:
                body_content = item.HTMLbody
                mysubject = item.Subject
                writeToFile(body_content, exportPath, mysubject)
                mycounter2 = mycounter2 + 1
                htmlEmails += 1
            except AttributeError:
                #print('Non HTML formatted email, passing')
                richPlainEmails += 1
                pass
        except Exception as e:
            encryptedEmails += 1
            pass

def writeToFile(messageHTML, path, mysubject):
    global mycounter2
    filename = '\htmloutput' + str(mycounter2) + '.html'

    #Check if email is rich or plain text first (only HTML emails are desired)
    if '<!-- Converted from text/plain format -->' in messageHTML or '<!-- Converted from text/rtf format -->' in messageHTML:
        raise AttributeError()

    else:
        file = open(path + filename, "x", encoding='utf-8')
        try:
            messageHTML = regex.sub('\r\n', '\n', messageHTML)
            file.write(messageHTML)

        #Handle any potential unexpected Unicode error
        except Exception as e:
            print('Exception: ' , e)
            try:
                #Prints email subject to more easily find the offending email
                print('Subject: ', mysubject)
                print(mycounter2)
                file.write(messageHTML)
            except Exception as e:
                print('Tried utf decode: ', e)

        file.close()

Because the emails otherwise are identical, I can only assume this is being done by the library. I'm wondering if there's a reason that meta tag is excluded, or if its a bug in PyWin32?

kxrob commented 3 years ago

The pywin32 COM interface doesn't do such special magic and just forwards the string as it comes in. Maybe Outlook is smart when returning item.HTMLbody ready as decoded unicode => The charset declaration tag (utf-8) would be redundant - and could conflict if you save to a file using a different encoding. Indeed, so far you use a fixed encoding (without BOM) instead of trying to adopt or fix an existing declaration! :file = open(path + filename, "x", encoding='utf-8')

So maybe re-insert your own charset tag accordingly as you write a plain byte string to file, or use a (utf-8) BOM ...

Nick-P-Orr commented 3 years ago

@kxrob Hmm, ok. I believe you may be right here that it is Outlook trying to be smart. Unfortunately for my needs, it not retaining that line is problematic (I need the source basically as it would be seen in Outlook). Thanks for the pointers though.

mhammond / pywin32

Possible Bug: one instance of <meta> tag always missing when reading emails from PST #1663