Closed Nick-P-Orr closed 3 years ago
The pywin32 COM interface doesn't do such special magic and just forwards the string as it comes in.
Maybe Outlook is smart when returning item.HTMLbody
ready as decoded unicode => The charset declaration tag (utf-8) would be redundant - and could conflict if you save to a file using a different encoding. Indeed, so far you use a fixed encoding (without BOM) instead of trying to adopt or fix an existing declaration! :file = open(path + filename, "x", encoding='utf-8')
So maybe re-insert your own charset tag accordingly as you write a plain byte string to file, or use a (utf-8) BOM ...
@kxrob Hmm, ok. I believe you may be right here that it is Outlook trying to be smart. Unfortunately for my needs, it not retaining that line is problematic (I need the source basically as it would be seen in Outlook). Thanks for the pointers though.
I've been developing a Python tool to ingest and write all emails from a PST exported from Outlook to individual .html files. The issue is that when opening the PST in outlook and checking the source information for emails individually, it includes this specific line:
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
which IS NOT being included when importing the PST with Pywin32 and reading all the emails in the PST. To see what it looks like in a chunk -
From Outlook:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><meta name=Generator content="Microsoft Word 15 (filtered medium)">
What is exported from the tool:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta name=Generator content="Microsoft Word 15 (filtered medium)">
The contents of the emails are otherwise ENTIRELY identical except for that one tag.
My code:
Because the emails otherwise are identical, I can only assume this is being done by the library. I'm wondering if there's a reason that meta tag is excluded, or if its a bug in PyWin32?