The goal in this issue is to write up a quick first-stab reference document that will enable us to use whatever file format conversions we want to use for testing.
Here's the current list of format conversions we're planning to test with during development:
Source
Target
.pdf
PDF-A 1b
.pdf
plaintext
.doc
PDF-A 1b
.doc
plaintext
.docx
PDF-A 1b
.docx
plaintext
.xls
TSV
.xlsx
TSV
.gif
TIFF
.bmp
TIFF
.jpg
TIFF
What we want, for each of the conversions in this table, is a command you can run at the UNIX shell to perform them. It doesn't need to have variables, or abstract command line options away, or anything like that. Just a concrete command you can type in using filenames like e.g. input.doc or output.pdf. We will determine how to handle command line options and different assumptions about input/output in a later issue.
Quick example. To convert a .doc to a PDF, you can use LibreOffice on the command line:
So in other words, the goal of this issue is to compile a list of example shell commands like this, together with information about what software packages must be installed to run them. We can put it in the root directory of the project for now, and make it either Org or Markdown. (Follow your bliss!) Each section of your document can just indicate:
the name of the package to install
the command to run to perform the conversion
Some pointers on utilities
LibreOffice
In addition to the above example, which converts a .doc to a PDF, I believe LibreOffice can be used similarly to convert a .doc to plaintext, and also to convert an .xls to PDF or plaintext.
pdf2archive
The utility pdf2archive can be used to convert a plain vanilla PDF to an archival PDF-A 1b. (In fact, that utility is just a shell script that handles the labyrinthine command line options that Ghostscript requires to produce a PDF that meets the demanding PDF-A 1b spec.)
pandoc
I believe pandoc (written by philosopher of language and Haskeller extraordinaire John MacFarlane) can be used to convert .docx to either PDF or plaintext. See if you can figure out how using some combination of the documentation, Stack Overflow, and anything else that could be useful.
If I recall correctly, pandoc can also convert .xls files to TSV format.
Image conversion utilities
imagemagick and vips can, I believe, be used to convert images in most standard formats to TIFF. Haven't looked at them in depth, but please feel free to experiment with other utilities too.
Research file format conversion utilities
The goal in this issue is to write up a quick first-stab reference document that will enable us to use whatever file format conversions we want to use for testing.
Here's the current list of format conversions we're planning to test with during development:
What we want, for each of the conversions in this table, is a command you can run at the UNIX shell to perform them. It doesn't need to have variables, or abstract command line options away, or anything like that. Just a concrete command you can type in using filenames like e.g.
input.doc
oroutput.pdf
. We will determine how to handle command line options and different assumptions about input/output in a later issue.Quick example. To convert a
.doc
to a PDF, you can use LibreOffice on the command line:So in other words, the goal of this issue is to compile a list of example shell commands like this, together with information about what software packages must be installed to run them. We can put it in the root directory of the project for now, and make it either Org or Markdown. (Follow your bliss!) Each section of your document can just indicate:
Some pointers on utilities
LibreOffice
In addition to the above example, which converts a
.doc
to a PDF, I believe LibreOffice can be used similarly to convert a.doc
to plaintext, and also to convert an.xls
to PDF or plaintext.pdf2archive
The utility pdf2archive can be used to convert a plain vanilla PDF to an archival PDF-A 1b. (In fact, that utility is just a shell script that handles the labyrinthine command line options that Ghostscript requires to produce a PDF that meets the demanding PDF-A 1b spec.)
pandoc
I believe
pandoc
(written by philosopher of language and Haskeller extraordinaire John MacFarlane) can be used to convert.docx
to either PDF or plaintext. See if you can figure out how using some combination of the documentation, Stack Overflow, and anything else that could be useful.https://pandoc.org/
If I recall correctly,
pandoc
can also convert.xls
files to TSV format.Image conversion utilities
imagemagick
andvips
can, I believe, be used to convert images in most standard formats to TIFF. Haven't looked at them in depth, but please feel free to experiment with other utilities too.https://imagemagick.org/ https://www.libvips.org/
Please feel free to pose any questions you may have on our Slack channel! Hopefully the above is enough to get started.