iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
98 stars 30 forks source link

WARC-Conversion-Software and WARC-Conversion-Command fields #52

Open ato opened 5 years ago

ato commented 5 years ago

When converting content in an archive it is useful for diagnostic purposes to record the versions of major software components used and important conversion options. Another common use case is to identify records that later need to be reconverted with newer software in order to improve conversion quality or fix records misconverted due to a bug or incorrect option.

Background

The options field proposed for standardisation below is based on "command" JSON field that @ikreymer uses in warcit. We would also like some way of recording this information for the Australian Web Archive.

WARC-Conversion-Software field

The WARC-Conversion-Software field indicates the version of software components used in the conversion of the record's content. The field value has the same format as a HTTP User-Agent field (see RFC7231 section 5.5.3) and consists of a list of one or more product identifiers and zero or more comments.

WARC-Conversion-Software = product *( RWS ( product / comment ) )

product         = token [ "/" product-version ]
product-version = token
comment         = "(" *( ctext / quoted-pair / comment ) ")"

For example:

WARC-Conversion-Software: ImageMagick/6.9.9-38 (x86 linux)

Multiple product identifiers may be used to indicate the version of important subcomponents such as codec libraries used when encoding a video.

WARC-Conversion-Software: ffmpeg/4.0.3 libvpx/1.8.0 libopus/1.3

When product identifiers represent multiple steps in a processing pipeline they should be listed in processing order and otherwise in decreasing order of significance for identifying the software. For example a TIFF image decoded with an unknown version of libtiff and then re-encoded with libjpeg version 9c could be recorded as:

WARC-Conversion-Software: libtiff libjpeg/9c

Software components unimportant to the conversion process, such as other codecs that a video transcoder happens to support but did not use, should not be listed.

The WARC-Conversion-Software field may be used in ‘conversion’ type records and shall not be used for other record types.

WARC-Conversion-Command field

The WARC-Conversion-Command field records command-line options used when converting the content.

WARC-Conversion-Software = *TEXT

When the conversion software is configured through command-line options a full command-line should be included with the tokens {input} and {output} representing the input and output file respectively.

WARC-Conversion-Command: ffmpeg -y -i {input} -c:v vp9 -c:a libopus -speed 4 {output}

A conversion involving multiple steps may be indicated using a shell pipeline

WARC-Conversion-Command: bzip2 -d | gzip -9

or multiple sequential commands separated by semi-colons:

WARC-Conversion-Command: ddjvu -format=tiff {input} tmp.tif; convert tmp.tif tmp.png;
                         pngcrush tmp.png {output}

The WARC-Conversion-Command field may be used in ‘conversion’ type records and shall not be used for other record types.

Alternative approaches

Separate metadata records

An obvious candidate would be a separate metadata record in some other specific-purpose format like PREMIS XML. For the Australian Web Archive we'd like to use these conversion fields not so much for highly detailed provenance records or the ability to exactly reproduce a conversion but rather as a human readable diagnostic and to quickly locate records that were converted in a particular way. While using a separate metadata field allows for more details and flexibility it makes this identifcation-at-a-glance use case much harder.

WARC-JSON-Metadata

@ikreymer also suggested the command field could be included as a property on a JSON object stored in the WARC-JSON-Metadata field proposed in #27. If we were to go back in time and redesign from WARC from scratch I think making WARC headers JSON-based rather than HTTP/1 based could be quite compelling. I would argue though that unless there's a serious show-stopper new standard fields should work within the existing framework and be added as new top-level header fields. I think its fine for individual tools to use something like WARC-JSON-Metadata to store implementation-specific data in their own native format though.

ikreymer commented 5 years ago

Thanks for starting this!

WARC-Conversion-Software is good, though perhaps should specify that it should be same format as software field in warcinfo records, or do you think the more formal spec makes sense?

For WARC-Conversion-Options, I'm not sure its a good idea to combine both a CLI string and JSON field into the header, as it'd make parsing harder. I can see two good options:

1) Always use JSON for WARC-Conversion-Options. If specifying a command line, it should be specified as WARC-Conversion-Options: {"command": ffmpeg -y -i {input} -c:v vp9 -c:a libopus -speed 4 {output}}

2) Add a separate header for a command, eg: WARC-Conversion-Command: ffmpeg -y -i {input} -c:v vp9 -c:a libopus -speed 4 {output}} is always a command line, while WARC-Conversion-Options is a generic JSON field with other options. Perhaps only one or the other should be used, although always good to leave room for additional options.

I guess I'd lean towards 1) as it seems more flexible overall, but 2) also makes sense, especially if most of the conversions are single command. But, then a question that arises of how extensive can the command be.. What if a conversion requires a custom command-line script or multiple programs?

To what degree should the conversion process be specified if it requires more than one command, and, getting into reproducibility a bit, should someone with the original record and information in the record be able to reproduce the same conversion? Probably a 'nice to have' but not something required, I would imagine.

ato commented 5 years ago

perhaps should specify that it should be same format as software field in warcinfo records, or do you think the more formal spec makes sense?

Unfortunately the format of the warcinfo software field is not specified, except by the example 'heritrix/1.12.0'. That example might be intended to indicate a format similar to User-Agent or it might not. Or perhaps that's would you meant, that the structure should be entirely unspecified?

It could well be I'm overthinking it but I think it's helpful to specify some limited structure to enable applications like searching for records that a particular tool were involved in the creation of. I can see us for example tokenising it into a Solr field we can use to pull back a list of everything created with libvorbis (video or audio). It does mean more effort for writers though and one might need to put some effort into mapping "Microsoft Excel 2017" into suitable identifier.

I'm not sure its a good idea to combine both a CLI string and JSON field into the header, as it'd make parsing harder

This is a fair criticism and had the same concern myself but wasn't entirely comfortable with any of the solutions I could think of. That might well be a sign I need to revisit the premise of what I was trying to do.

While JSON and CLI options are two ways conversion software could be configured, there could be more. On the other hand one of the biggest complaints I have about WARC (and many other library standards) is they under-specify things and don't give enough guidance. I'd frankly rather not have a field standardised at all then have it specified so flexibly it's effectively useless. So maybe I'm even being overly hasty in wanting to generalize 'command' to 'options' in the first place and we should start just with WARC-Conversion-Command and worry about more general options once we actually have some more concrete examples.

To what degree should the conversion process be specified if it requires more than one command, and, getting into reproducibility a bit, should someone with the original record and information in the record be able to reproduce the same conversion? Probably a 'nice to have' but not something required, I would imagine.

I agree. I don't think full reproducibility should be a goal for these headers. I think one or two commands or a one-liner pipeline is a helpful indicator but if it gets to the point of a 20+ line script containing a bunch of branches then that should be stored elsewhere and just referenced rather than included in full in every record header. That's what I was trying to get at with this paragraph:

If the conversion options are not representable in a short text form suitable for including in a header field they may be recorded separately in one or more ‘metadata’ records. In such cases the WARC-Conversion-Options field may still include a short textual summary of only the most important options for diagnostic purposes.

I think it should be a short set of options, one or two lines at most. I would frown upon the inclusion of several kilobytes of XML or JSON or a base64 encoded ICC profile. Some institutions may well want that level of detail but I think that's well beyond the scope of this proposal. I think we should keep this focused on diagnostics and being able to find records that were converted in a particular way. Chances are the institutions that do want that level of detail have some sort of specialised digital preservation system for recording it anyway.

ikreymer commented 5 years ago

Unfortunately the format of the warcinfo software field is not specified, except by the example 'heritrix/1.12.0'. That example might be intended to indicate a format similar to User-Agent or it might not. Or perhaps that's would you meant, that the structure should be entirely unspecified?

Yes, that's what I meant, that the structure should not be specified. Unfortunately, ffmpeg does not print things out in a clean format like that. Instead, ffmpeg -version prints out:

ffmpeg version 2.8.15-0ubuntu0.16.04.1 Copyright (c) 2000-2018 the FFmpeg developers
  built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.10) 20160609
  configuration: --prefix=/usr --extra-version=0ubuntu0.16.04.1 --build-suffix=-ffmpeg --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --cc=cc --cxx=g++ --enable-gpl --enable-shared --disable-stripping --disable-decoder=libopenjpeg --disable-decoder=libschroedinger --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmodplug --enable-libmp3lame --enable-libopenjpeg --enable-libopus --enable-libpulse --enable-librtmp --enable-libschroedinger --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxvid --enable-libzvbi --enable-openal --enable-opengl --enable-x11grab --enable-libdc1394 --enable-libiec61883 --enable-libzmq --enable-frei0r --enable-libx264 --enable-libopencv
  libavutil      54. 31.100 / 54. 31.100
  libavcodec     56. 60.100 / 56. 60.100
  libavformat    56. 40.101 / 56. 40.101
  libavdevice    56.  4.100 / 56.  4.100
  libavfilter     5. 40.101 /  5. 40.101
  libavresample   2.  1.  0 /  2.  1.  0
  libswscale      3.  1.101 /  3.  1.101
  libswresample   1.  2.101 /  1.  2.101
  libpostproc    53.  3.100 / 53.  3.100

Now, it'd be easy to encode this in a JSON string as JSON also provides a nice way to encode multiline strings. Converting it to a different format would take more effort. For practical reasons, you'd probably want to compare the output from ffmpeg -version to see if there are any difference that might yield different results, so storing the full version info is probably a good idea. Of course, this is longer than the average version string, but a very real example of a tool now being used. What if we had a semi-standardized JSON format, starting with something like:

WARC-Conversion-Command:  {"command": "...", "version": "..."}

Agree that perhaps additional use cases can help inform if we need more options. Content-Type already can specify the mime type, so that's take care of.

I agree. I don't think full reproducibility should be a goal for these headers. I think one or two commands or a one-liner pipeline is a helpful indicator but if it gets to the point of a 20+ line script containing a bunch of branches then that should be stored elsewhere and just referenced rather than included in full in every record header. That's what I was trying to get at with this paragraph:

Ah right, I missed that paragraph somehow :) That all makes sense.

ikreymer commented 5 years ago

Though, I like the idea of one commandline field and one misc other info field. Likely, there will be some conversion command that can be expressed as a single line, but there may also be additional metadata about the conversion, such as the version, perhaps other properties.. With that in mind, perhaps it should be:

WARC-Conversion-Command: ffmpeg -y -i {input} -c:v vp9 -c:a libopus -speed 4 {output}
WARC-Conversion-Metadata: {"version": ...}

The WARC-Conversion-Command should always be present, and if its some custom script it can just be WARC-Conversion-Command: super-complex-conversion.sh {input} {output}.

But, the WARC-Conversion-Metadata is optional, and as you suggest, a metadata record can be used instead, especially if representing a some large script. In that case, perhaps the script should just be stored in a separate metadata record:

WARC-Conversion-Command: super-complex-conversion.sh {input} {output}
WARC-Concurrent-To: <metadata-record>
WARC-Type: conversion
...

WARC-Target-URI: file:///super-complex-conversion.sh
WARC-Record-ID: <metadata-record>
WARC-Type: metadata
...
#!/bin/bash
...

This would make sense if the same conversion software/script is used for multiple conversions. It could be argued that the conversion software should always have its own metadata record, even for ffmpeg, and then that could contain the version text.

A metadata record ffmpeg might then look as follows:

...
WARC-Target-URI: urn:software:ffmpeg:version
WARC-Type:  metadata
...
ffmpeg version 2.8.15-0ubuntu0.16.04.1 Copyright (c) 2000-2018 the FFmpeg developers
  built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.10) 20160609
  configuration: --prefix=/usr --extra-version=0ubuntu0.16.04.1 --build-suffix=-ffmpeg --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --cc=cc --cxx=g++ --enable-gpl --enable-shared --disable-stripping --disable-decoder=libopenjpeg --disable-decoder=libschroedinger --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmodplug --enable-libmp3lame --enable-libopenjpeg --enable-libopus --enable-libpulse --enable-librtmp --enable-libschroedinger --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxvid --enable-libzvbi --enable-openal --enable-opengl --enable-x11grab --enable-libdc1394 --enable-libiec61883 --enable-libzmq --enable-frei0r --enable-libx264 --enable-libopencv
  libavutil      54. 31.100 / 54. 31.100
  libavcodec     56. 60.100 / 56. 60.100
  libavformat    56. 40.101 / 56. 40.101
  libavdevice    56.  4.100 / 56.  4.100
  libavfilter     5. 40.101 /  5. 40.101
  libavresample   2.  1.  0 /  2.  1.  0
  libswscale      3.  1.101 /  3.  1.101
  libswresample   1.  2.101 /  1.  2.101
  libpostproc    53.  3.100 / 53.  3.100

Of course, this is getting a bit more complicated.. Are there other fields besides software version that would be important to store at this point?

ato commented 5 years ago

I've updated the proposal replacing WARC-Conversion-Options with the more specific WARC-Conversion-Command field.

Unfortunately, ffmpeg does not print things out in a clean format like that

Yeah, that's one of several reasons why recording the raw output of --version is not a particularly good solution to the problem I'm trying to solve and I'd rather record something more structured:

If someone did want to store --version then I agree with you that a common record that can be referenced might be better. But I actually don't want to record --version at all. ;-)

Also note: I'm not proposing that every tool that writes conversion records has to record down to the detail of libopus. It's just what I'd like to do with our archive and if other people want to do that too it'd be good to have a standard format for recording it.

In the case of a generic tool like warcit that doesn't know anything in particular about the external conversion command it would probably be best to make optional and let the user supply it if they want to.

The WARC-Conversion-Command should always be present

While it would be commonly present, I don't think we should make it mandatory in case the conversion was created by something that's not a command-line tool or is not a simple case of one input, one output.

Are there other fields besides software version that would be important to store at this point?

Trying to think of metadata people might want to record about a conversion:

I don't have an immediate need for any of that though.

Although that does indicate that WARC-Conversion-Software may not be the best name for the version field. Maybe WARC-Converter-Version. That would leave open things like WARC-Converter-URI to reference an actual tarball of the conversion software.