> time attc < example.mbox > example_converted.mbox 2> example_errors.mbox
________________________________________________________
Executed in 73.57 mins fish external
usr time 54.65 secs 0.00 millis 54.65 secs
sys time 11.61 secs 1.34 millis 11.61 secs
We will undoubtedly learn more after profiling the code, but a quick eyeball running that same command with progress bar output shows that each conversion to a PDF-A is taking a long time. This is likely due to the fact that we are currently using LibreOffice to do the following conversions:
PDF >> PDF-A
DOC >> PDF-A
DOCX >> PDF-A
It will probably be a good idea at some point to explore utilities that can perform these conversions faster, since LibreOffice has introduced other inconveniences as well (needing to create a profile in order to be run on the command line, requiring a running X session when run on Linux, etc.). However, we will postpone that to a future issue and focus here on parallelizing the code.
Possible approaches
This part of the issue will theoretically get fleshed out once we learn more about the joys of parallel code in the modern era. However, here are three starting points to look into. The parmap package is set up to do CPU-bound list map calculations in parallel. It is pre-multicore OCaml. The parany package, following the release of multicore OCaml, has been re-implemented using domainslib. Finally, there is domainslib itself, which is lower-level than either of the previous libraries, but which does expose a "parallel for loop" function which it may be adaptable to our use case.
The problem
Currently, Attachment Converter takes a long time to process an MBOX. For example, here is an 8 MB MBOX file:
It contains a pretty decent number of attachments in the formats we are looking for:
And it takes 74 minutes to process:
We will undoubtedly learn more after profiling the code, but a quick eyeball running that same command with progress bar output shows that each conversion to a PDF-A is taking a long time. This is likely due to the fact that we are currently using LibreOffice to do the following conversions:
It will probably be a good idea at some point to explore utilities that can perform these conversions faster, since LibreOffice has introduced other inconveniences as well (needing to create a profile in order to be run on the command line, requiring a running X session when run on Linux, etc.). However, we will postpone that to a future issue and focus here on parallelizing the code.
Possible approaches
This part of the issue will theoretically get fleshed out once we learn more about the joys of parallel code in the modern era. However, here are three starting points to look into. The
parmap
package is set up to do CPU-bound list map calculations in parallel. It is pre-multicore OCaml. Theparany
package, following the release of multicore OCaml, has been re-implemented usingdomainslib
. Finally, there isdomainslib
itself, which is lower-level than either of the previous libraries, but which does expose a "parallel for loop" function which it may be adaptable to our use case.