libyal / libpff

Library and tools to access the Personal Folder File (PFF) and the Offline Folder File (OFF) format
GNU Lesser General Public License v3.0
286 stars 74 forks source link

HTML export img src links do not link to Attachments folder #14

Closed TobalJackson closed 7 years ago

TobalJackson commented 8 years ago

Not sure if this is appropriate for an issue, but when I use LibPFF to explode a PST file specifying the "HTML" format, when viewing the 'Message.html' file, none of the inline image links (linking to the ./Attachments folder for the message) are written to actually incorporate these attachments, instead being prefixed by "cid:" and postfixed by a random identifier. Example: <img src="cid:image007.jpg@01D0C7A0.D12DD140"> when the matching file attachment is in "./Attachments/image007.jpg". In order to get the HTML to render properly I have to manually rewrite the img src= link to read: <img src="Attachments/image007.jpg">.

joachimmetz commented 8 years ago

pffexport exports the data "as stored" in the file; this is the desired behavior for its main use case.

To correct the src values of the HTML would be to alter the data.

In order to get the HTML to render properly I have to manually rewrite the img src= link to read:

You could write a script that does this for you instead of doing it manually.

TobalJackson commented 8 years ago

joachimmetz, I am actually in the process of writing a rather large script to do this. The main problem I'm having is with an inconsistent naming scheme being used between the "cid:" links and the file names in the exploded "Attachments" folder. Most of the time (>90%) of messages, the "cid:" file name matches the file name of the file in the Attachments folder, however in some cases, the "cid" file name has nothing in common with the filename of the appropriate image attachment in "Attachments", and I have to resort to some rather painstaking measures (such as matching the <img width=xx height=xx... attributes with those of the attachment files) to rewrite the HTML.

I guess what I'm wondering is if there is some sort of a lookup table which correlates the inline image link names with the "Attachments/" folder file names so that I can exactly match which files are linked to from within the HTML or not?

joachimmetz commented 8 years ago

So technically a PST file is a MAPI properties database. More info about these MAPI properties can be found here: https://googledrive.com/host/0B3fBvzttpiiSRlR1QkU5Vk43ZWs/MAPI%20definitions.pdf

My guess would be that there is likely a MAPI property that maps the cid to the corresponding attachment in the PST. Note that pffexport can alter the resulting filename e.g. if characters that are unsupported by the file system used, or when duplicate filenames are encountered.

I would start with having a look at the contents of the corresponding attachments "table" or the table of an individual attachment as stored in the database.