pst-format / libpst

library for reading Microsoft Outlook PST files
GNU General Public License v2.0
18 stars 4 forks source link

Feature request: readpst should produce identical output for identical mails (?switch-controlled? boundary behavior) #9

Open fhanzlik opened 1 year ago

fhanzlik commented 1 year ago

I now have task: extract mails from several (10+) .PST files (all from one account, collected over the past 15 years or so as bakups), remove duplicities and convert mails into MAILDIR structure. My idea was to extract individual messages from these .PST (using the libpst/readpst) to separate trees, then delete duplicities (using eg. fdupes) and then join result.

In real (apart from the problem of different number of extracted files when processing one .pst file repeatedly - issue #7 touch it), I ran into the problem of detecting the identical/duplicit messages - because readpst now generate internal message boundaries as random strings. Thus even identical messages not appears so:

$ diff /home/mail/outlook-r2020/archive.pst.mdi/.Doručená\ pošta/cur/1681064600.005298:2,S /home/mail/outlook-r2023/outlook.pst.mdi/.Doručená\ pošta/cur/1681059416.005051:2,S 
38c38
<       boundary="--boundary-LibPST-iamunique-1906170776_-_-"
---
>       boundary="--boundary-LibPST-iamunique-1627685354_-_-"  
41c41
< ----boundary-LibPST-iamunique-1906170776_-_-
---
> ----boundary-LibPST-iamunique-1627685354_-_-  
112782c112782
< ----boundary-LibPST-iamunique-1906170776_-_-
---
> ----boundary-LibPST-iamunique-1627685354_-_-  

Perhaps should be somehow (some switch for this behavior) possible to generate predictable and same in all mails boundaries strings - so the same mails would also be presented by the same message files (in terms of content, not file names).

Thanks in advance, Franta Hanzlík

pabs3 commented 1 year ago

I think the default behaviour should be deterministic output, I also doubt there is a use-case for non-deterministic output, so we should not need to keep the current behaviour at all.

The non-determinism you mention is just a random integer from rand().

In addition to the random boundary name, there is also a random filename for mails with appointments converted to calendar files.

I welcome patches for both these issues and any other ones that you can find, please add more comments if you notice other issues and submit merge requests for any fixes you make. If you aren't able to work on fixes, then I will work on it when I find some time.

I quote from the code for the issue mentioned above:

src/readpst.c-1728-    // create our MIME boundaries here.
src/readpst.c:1729:    snprintf(boundary, sizeof(boundary), "--boundary-LibPST-iamunique-%i_-_-", rand());
src/readpst.c-1730-    snprintf(altboundary, sizeof(altboundary), "alt-%s", boundary);
src/readpst.c-1664-    // attachment appointment request
src/readpst.c:1665:    snprintf(fname, sizeof(fname), "i%i.ics", rand());
src/readpst.c-1666-    fprintf(f_output, "\n--%s\n", boundary);
src/readpst.c-1667-    fprintf(f_output, "Content-Type: %s; charset=\"%s\"; name=\"%s\"\n", "text/calendar", "utf-8", fname);
src/readpst.c-1668-    fprintf(f_output, "Content-Disposition: attachment; filename=\"%s\"\n\n", fname);
src/readpst.c-1669-    write_schedule_part_data(f_output, item, sender, method);
src/readpst.c-1670-    fprintf(f_output, "\n");

-- bye, pabs

https://bonedaddy.net/pabs3/

fhanzlik commented 1 year ago

Hi Paul, I can help with some testing, maybe even scripting or creating an RPM package or even contributing some money to support the project (I didn't find how here), but programming will probably be beyond my capabilities and abilities - I apologize. Franta Hanzlik

pabs3 commented 1 year ago

No need to apologise, in open source every contribution is useful, even feature request suggestions and user discussions.

As a freelance open source developer I am always on the lookout for opportunities. Please send me an email to discuss the specifics.

-- bye, pabs

https://bonedaddy.net/pabs3/

pabs3 commented 1 year ago

@fhanzlik this feature has been implemented in git, could you test it? It works for me but I'd like a second set of eyes and data before closing it.