uchicago-library / attachment-converter

Attachment Converter: tool for batch converting attachments in an email mailbox
GNU General Public License v2.0
8 stars 3 forks source link

Research file format conversion utilities #10

Closed bufordrat closed 2 years ago

bufordrat commented 2 years ago

Research file format conversion utilities

The goal in this issue is to write up a quick first-stab reference document that will enable us to use whatever file format conversions we want to use for testing.

Here's the current list of format conversions we're planning to test with during development:

Source Target
.pdf PDF-A 1b
.pdf plaintext
.doc PDF-A 1b
.doc plaintext
.docx PDF-A 1b
.docx plaintext
.xls TSV
.xlsx TSV
.gif TIFF
.bmp TIFF
.jpg TIFF

What we want, for each of the conversions in this table, is a command you can run at the UNIX shell to perform them. It doesn't need to have variables, or abstract command line options away, or anything like that. Just a concrete command you can type in using filenames like e.g. input.doc or output.pdf. We will determine how to handle command line options and different assumptions about input/output in a later issue.

Quick example. To convert a .doc to a PDF, you can use LibreOffice on the command line:

$ soffice --headless --convert-to pdf:writer_pdf_Export --outdir . input.doc

So in other words, the goal of this issue is to compile a list of example shell commands like this, together with information about what software packages must be installed to run them. We can put it in the root directory of the project for now, and make it either Org or Markdown. (Follow your bliss!) Each section of your document can just indicate:

Some pointers on utilities

LibreOffice

In addition to the above example, which converts a .doc to a PDF, I believe LibreOffice can be used similarly to convert a .doc to plaintext, and also to convert an .xls to PDF or plaintext.

pdf2archive

The utility pdf2archive can be used to convert a plain vanilla PDF to an archival PDF-A 1b. (In fact, that utility is just a shell script that handles the labyrinthine command line options that Ghostscript requires to produce a PDF that meets the demanding PDF-A 1b spec.)

pandoc

I believe pandoc (written by philosopher of language and Haskeller extraordinaire John MacFarlane) can be used to convert .docx to either PDF or plaintext. See if you can figure out how using some combination of the documentation, Stack Overflow, and anything else that could be useful.

https://pandoc.org/

If I recall correctly, pandoc can also convert .xls files to TSV format.

Image conversion utilities

imagemagick and vips can, I believe, be used to convert images in most standard formats to TIFF. Haven't looked at them in depth, but please feel free to experiment with other utilities too.

https://imagemagick.org/ https://www.libvips.org/

Please feel free to pose any questions you may have on our Slack channel! Hopefully the above is enough to get started.