ucsdlib / damsrepo

DAMS Repository
Other
4 stars 2 forks source link

R&D: Replace Jhove with FITS and other tools to support more file types #67

Closed lsitu closed 6 years ago

lsitu commented 6 years ago

Descriptive summary

Propose a solution to replace Jhove with FITS and other tools to support more file types in general.

Rationale

We've used Jhove in DAMS to extract technical metadata for many years, while the file formats supported by Jhove are very limited. This ticket is to explore the solution to replace Jhove with FITS and/or other tools for more general file type supports.

lsitu commented 6 years ago

@gamontoya I found FITS can extract some metadata related to quality from videos/audios. In this case, we may not need to run FFMPEG. Is it possible to assemble the quality value from the following fields of a .mov file:

  <metadata>
    <video>
      <location toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">/Users/lsitu/Documents/git/damsrepo/src/sample/files/video.mp4</location>
      <mimeType toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">video/quicktime</mimeType>
      <format toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">Quicktime</format>
      <formatProfile toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">Base Media</formatProfile>
      <duration toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">5039</duration>
      <bitRate toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">1249972</bitRate>
      <dateCreated toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">UTC 1904-01-01 00:00:00</dateCreated>
      <dateModified toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">UTC 2015-04-10 18:33:58</dateModified>
      <track type="video" id="1" toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">
        <videoDataEncoding>avc1</videoDataEncoding>
        <codecId>avc1</codecId>
        <codecCC>avc1</codecCC>
        <codecVersion>High@L3</codecVersion>
        <codecName>AVC</codecName>
        <codecFamily>H.264</codecFamily>
        <codecInfo>Advanced Video Codec</codecInfo>
        <compression>Unknown</compression>
        <byteOrder>Unknown</byteOrder>
        <bitDepth>8 bits</bitDepth>
        <bitRate>1114069</bitRate>
        <duration>5039</duration>
        <trackSize>701637</trackSize>
        <width>720 pixels</width>
        <height>480 pixels</height>
        <frameRate>29.970</frameRate>
        <frameRateMode>Constant</frameRateMode>
        <frameCount>151</frameCount>
        <aspectRatio>4:3</aspectRatio>
        <scanningFormat>Progressive</scanningFormat>
        <chromaSubsampling>4:2:0</chromaSubsampling>
        <colorspace>YUV</colorspace>
        <broadcastStandard>NTSC</broadcastStandard>
      </track>
      <track type="audio" id="2" toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">
        <audioDataEncoding>AAC</audioDataEncoding>
        <codecId>40</codecId>
        <codecFamily>AAC</codecFamily>
        <compression>Lossy</compression>
        <bitRate>128000</bitRate>
        <bitRateMode>Variable</bitRateMode>
        <duration>5035</duration>
        <trackSize>80011</trackSize>
        <soundField>Front: L R</soundField>
        <samplingRate>48000</samplingRate>
        <numSamples>241680</numSamples>
        <channels>2</channels>
      </track>
    </video>
  </metadata>
lsitu commented 6 years ago

@gamontoya / @mcritchlow I see FITS may report conflict results from different tools. So far I am seeing it in <identification> and <fileinfo> sections. For example, we could get the following result for <identification> with a mp4 video:

<identification status="CONFLICT">
    <identity format="Quicktime" mimetype="video/quicktime" toolname="FITS" toolversion="build.version=1.3.0">
      <tool toolname="MediaInfo" toolversion="0.7.75" />
    </identity>
    <identity format="MPEG-4 Media File" mimetype="application/mp4, video/mp4" toolname="FITS" toolversion="build.version=1.3.0">
      <tool toolname="Droid" toolversion="6.3" />
      <externalIdentifier toolname="Droid" toolversion="6.3" type="puid">fmt/199</externalIdentifier>
    </identity>
    <identity format="ISO Media, MPEG v4 system, version 1" mimetype="video/mp4" toolname="FITS" toolversion="build.version=1.3.0">
      <tool toolname="file utility" toolversion="5.04" />
    </identity>
</identification>

We have the option to turn off conflict outputs to just allow single result, but I see that may lost outputs from other tools in section like <metadata>. So I am thinking about allowing the conflict outputs but choosing the the result we want in order. For example, if there's the result from Jhove, we will use it as first priority. If no Jhove output, then we can choose it from ExifTool, file utility in order. If non of the outputs are fromJhove, ExifTool, or file utility, then we can just use one of the conflict results. In the case above, the result from file utility will be used, which is mimetype="video/mp4" and format="ISO Media, MPEG v4 system, version 1". Do you have any preferences/thoughts on handling the conflict results? Does the strategic above sound good to start with?

gamontoya commented 6 years ago

@lsitu I'm not sure we would want to turn off CONFLICT, and I like what you're proposing. However, let me ask @arwenhutt for her opinion.

arwenhutt commented 6 years ago

@gamontoya @lsitu I don't really understand what a "conflict output" is or what the implications are. I think Tori Manches (she might need to be invited to GitHub) and @stelnabli know more about the tools and technical info - so would probably understand better.

lsitu commented 6 years ago

@arwenhutt What I am understanding for conflict output here is that the set of tools used by FITS to extract technical metadata give different results for a field like mimetype. In the above example, we got three results for mimetype with the mp4 video tested: MediaInfo: video/quicktime Droid: application/mp4, video/mp4 file utility: video/mp4

So I think we may need to decide which result we prefer to use in dams4 when there are multiple results extracted by FITS.

stelnabli commented 6 years ago

@arwenhutt @gamontoya @lsitu Arwen, that's correct. Tools and format tree be customized in config file to avoid conflict reports. https://projects.iq.harvard.edu/fits/fits-configuration-files.

Related FITS/techMD research: https://github.com/ucsdlib/dams-metadata/issues/69#issuecomment-419588278 & https://goo.gl/iuJGD3

lsitu commented 6 years ago

Thanks @stelnabli. I've tested that single result and even tried to re-order the order of the conflict outputs. I see it may lost information that could be extracted by other tools if I configure it to use the singe result. I think the question is going back to my origin comment https://github.com/ucsdlib/damsrepo/issues/67#issuecomment-426018077 now. Do we want to have a single result that may come from any of those tool set to avoid the conflict? Or do we want to selectively choose the result from those tool set when there are conflict results?

stelnabli commented 6 years ago

@lsitu I'm cautious like you've mentioned about losing information that can be extracted by other tools when one tool may be better than another for a particular file format. So I think caution would support selectively choosing the result rather than having a single result that may lead to loss of info. I do not know practical examples from others' experience using FITS, would be good to find out pro/con of approaches since our DAMS conceivably will get all manner of format (especially as born-digital preservation ramps up).

arwenhutt commented 6 years ago

happily defers to those who know more about technical metadata

tmaches commented 6 years ago

I like the idea of selectively choosing the result, too, since like @stelnabli said one tool might work better than another depending on the format. I'd also like to know about others' experiences with FITS, since more of my experience has been in trying to implement it, and less how things are long-term after implementation.

lsitu commented 6 years ago

@vmaches Thank you very much for your inputs.

lsitu commented 6 years ago

@gamontoya / @mcritchlow I've implemented the solution to replace Jhove with FITS and other tools like FFMPEG. I see it an addition with much more file types support than Jhove with the following questions open:

  1. How are we going to handle those conflict results from FITS? Currently I've implemented it to choose it from Jhove, ExifTool, File Utility in order if it exists. Otherwise, the first value in the output will be applied.
  2. With videos, do we want to utilize the output from FITS in metadata/video section above to assemble the value for dams:quality or not? This will be more efficient, and we can still run FFMPEG to extract the quality value if nothing return from FITS.
  3. The format name is more specific in FITS compare to Jhove. As we see from the above example, it's something like Quicktime, MPEG-4 Media File, ISO Media, MPEG v4 system, version 1 etc. Do we want to convert it to a standard format name or not?

With the PR that is integrated the solution with damsrepo, I think we can deploy it to QA/Staging and test it with damsmanager to ingest different file formats/files to see how it goes. And we can make the decision for the issues above basing on the test results we found. What do you think?

lsitu commented 6 years ago

@gamontoya / @mcritchlow I think the approach to replace Jhove with FITS and other tools are ready for review and evaluation, see PR https://github.com/ucsdlib/damsrepo/pull/68.