uchicago-library / attachment-converter

Attachment Converter: tool for batch converting attachments in an email mailbox
GNU General Public License v2.0
8 stars 3 forks source link

Speedup: MBOX mode should do some automatic parallelization #73

Open bufordrat opened 1 year ago

bufordrat commented 1 year ago

The problem

Currently, Attachment Converter takes a long time to process an MBOX. For example, here is an 8 MB MBOX file:

> ls -lah example.mbox 
Permissions Size User   Date Modified  Name
.rw-r--r--  8.8M me     5 May 15:58    example.mbox

It contains a pretty decent number of attachments in the formats we are looking for:

> attc -r example.mbox 
Content Types:
  application/msword : 24
  application/octet-stream : 3
  application/pdf : 10
  application/rtf : 32
  image/jpeg : 7
  message/rfc822 : 29
  multipart/alternative : 1
  multipart/mixed : 315
  text/plain : 311

And it takes 74 minutes to process:

> time attc < example.mbox > example_converted.mbox 2> example_errors.mbox

________________________________________________________
Executed in   73.57 mins    fish           external
   usr time   54.65 secs    0.00 millis   54.65 secs
   sys time   11.61 secs    1.34 millis   11.61 secs

We will undoubtedly learn more after profiling the code, but a quick eyeball running that same command with progress bar output shows that each conversion to a PDF-A is taking a long time. This is likely due to the fact that we are currently using LibreOffice to do the following conversions:

It will probably be a good idea at some point to explore utilities that can perform these conversions faster, since LibreOffice has introduced other inconveniences as well (needing to create a profile in order to be run on the command line, requiring a running X session when run on Linux, etc.). However, we will postpone that to a future issue and focus here on parallelizing the code.

Possible approaches

This part of the issue will theoretically get fleshed out once we learn more about the joys of parallel code in the modern era. However, here are three starting points to look into. The parmap package is set up to do CPU-bound list map calculations in parallel. It is pre-multicore OCaml. The parany package, following the release of multicore OCaml, has been re-implemented using domainslib. Finally, there is domainslib itself, which is lower-level than either of the previous libraries, but which does expose a "parallel for loop" function which it may be adaptable to our use case.