Closed lsitu closed 6 years ago
@gamontoya I found FITS can extract some metadata related to quality from videos/audios. In this case, we may not need to run FFMPEG. Is it possible to assemble the quality value from the following fields of a .mov file:
<metadata>
<video>
<location toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">/Users/lsitu/Documents/git/damsrepo/src/sample/files/video.mp4</location>
<mimeType toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">video/quicktime</mimeType>
<format toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">Quicktime</format>
<formatProfile toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">Base Media</formatProfile>
<duration toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">5039</duration>
<bitRate toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">1249972</bitRate>
<dateCreated toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">UTC 1904-01-01 00:00:00</dateCreated>
<dateModified toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">UTC 2015-04-10 18:33:58</dateModified>
<track type="video" id="1" toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">
<videoDataEncoding>avc1</videoDataEncoding>
<codecId>avc1</codecId>
<codecCC>avc1</codecCC>
<codecVersion>High@L3</codecVersion>
<codecName>AVC</codecName>
<codecFamily>H.264</codecFamily>
<codecInfo>Advanced Video Codec</codecInfo>
<compression>Unknown</compression>
<byteOrder>Unknown</byteOrder>
<bitDepth>8 bits</bitDepth>
<bitRate>1114069</bitRate>
<duration>5039</duration>
<trackSize>701637</trackSize>
<width>720 pixels</width>
<height>480 pixels</height>
<frameRate>29.970</frameRate>
<frameRateMode>Constant</frameRateMode>
<frameCount>151</frameCount>
<aspectRatio>4:3</aspectRatio>
<scanningFormat>Progressive</scanningFormat>
<chromaSubsampling>4:2:0</chromaSubsampling>
<colorspace>YUV</colorspace>
<broadcastStandard>NTSC</broadcastStandard>
</track>
<track type="audio" id="2" toolname="MediaInfo" toolversion="0.7.75" status="SINGLE_RESULT">
<audioDataEncoding>AAC</audioDataEncoding>
<codecId>40</codecId>
<codecFamily>AAC</codecFamily>
<compression>Lossy</compression>
<bitRate>128000</bitRate>
<bitRateMode>Variable</bitRateMode>
<duration>5035</duration>
<trackSize>80011</trackSize>
<soundField>Front: L R</soundField>
<samplingRate>48000</samplingRate>
<numSamples>241680</numSamples>
<channels>2</channels>
</track>
</video>
</metadata>
@gamontoya / @mcritchlow I see FITS may report conflict results from different tools. So far I am seeing it in <identification>
and <fileinfo>
sections. For example, we could get the following result for <identification>
with a mp4 video:
<identification status="CONFLICT">
<identity format="Quicktime" mimetype="video/quicktime" toolname="FITS" toolversion="build.version=1.3.0">
<tool toolname="MediaInfo" toolversion="0.7.75" />
</identity>
<identity format="MPEG-4 Media File" mimetype="application/mp4, video/mp4" toolname="FITS" toolversion="build.version=1.3.0">
<tool toolname="Droid" toolversion="6.3" />
<externalIdentifier toolname="Droid" toolversion="6.3" type="puid">fmt/199</externalIdentifier>
</identity>
<identity format="ISO Media, MPEG v4 system, version 1" mimetype="video/mp4" toolname="FITS" toolversion="build.version=1.3.0">
<tool toolname="file utility" toolversion="5.04" />
</identity>
</identification>
We have the option to turn off conflict outputs to just allow single result, but I see that may lost outputs from other tools in section like <metadata>
. So I am thinking about allowing the conflict outputs but choosing the the result we want in order. For example, if there's the result from Jhove, we will use it as first priority. If no Jhove output, then we can choose it from ExifTool
, file utility
in order. If non of the outputs are fromJhove
, ExifTool
, or file utility
, then we can just use one of the conflict results. In the case above, the result from file utility
will be used, which is mimetype="video/mp4"
and format="ISO Media, MPEG v4 system, version 1"
. Do you have any preferences/thoughts on handling the conflict results? Does the strategic above sound good to start with?
@lsitu I'm not sure we would want to turn off CONFLICT
, and I like what you're proposing. However, let me ask @arwenhutt for her opinion.
@gamontoya @lsitu I don't really understand what a "conflict output" is or what the implications are. I think Tori Manches (she might need to be invited to GitHub) and @stelnabli know more about the tools and technical info - so would probably understand better.
@arwenhutt What I am understanding for conflict output
here is that the set of tools used by FITS to extract technical metadata give different results for a field like mimetype
. In the above example, we got three results for mimetype with the mp4 video tested:
MediaInfo: video/quicktime
Droid: application/mp4, video/mp4
file utility: video/mp4
So I think we may need to decide which result we prefer to use in dams4 when there are multiple results extracted by FITS.
@arwenhutt @gamontoya @lsitu Arwen, that's correct. Tools and format tree be customized in config file to avoid conflict reports. https://projects.iq.harvard.edu/fits/fits-configuration-files.
Related FITS/techMD research: https://github.com/ucsdlib/dams-metadata/issues/69#issuecomment-419588278 & https://goo.gl/iuJGD3
Thanks @stelnabli. I've tested that single result and even tried to re-order the order of the conflict outputs. I see it may lost information that could be extracted by other tools if I configure it to use the singe result. I think the question is going back to my origin comment https://github.com/ucsdlib/damsrepo/issues/67#issuecomment-426018077 now. Do we want to have a single result that may come from any of those tool set to avoid the conflict? Or do we want to selectively choose the result from those tool set when there are conflict results?
@lsitu I'm cautious like you've mentioned about losing information that can be extracted by other tools when one tool may be better than another for a particular file format. So I think caution would support selectively choosing the result rather than having a single result that may lead to loss of info. I do not know practical examples from others' experience using FITS, would be good to find out pro/con of approaches since our DAMS conceivably will get all manner of format (especially as born-digital preservation ramps up).
happily defers to those who know more about technical metadata
I like the idea of selectively choosing the result, too, since like @stelnabli said one tool might work better than another depending on the format. I'd also like to know about others' experiences with FITS, since more of my experience has been in trying to implement it, and less how things are long-term after implementation.
@vmaches Thank you very much for your inputs.
@gamontoya / @mcritchlow I've implemented the solution to replace Jhove with FITS and other tools like FFMPEG. I see it an addition with much more file types support than Jhove with the following questions open:
metadata/video
section above to assemble the value for dams:quality or not? This will be more efficient, and we can still run FFMPEG to extract the quality value if nothing return from FITS.Quicktime
, MPEG-4 Media File
, ISO Media, MPEG v4 system, version 1
etc. Do we want to convert it to a standard format name or not?With the PR that is integrated the solution with damsrepo, I think we can deploy it to QA/Staging and test it with damsmanager to ingest different file formats/files to see how it goes. And we can make the decision for the issues above basing on the test results we found. What do you think?
@gamontoya / @mcritchlow I think the approach to replace Jhove with FITS and other tools are ready for review and evaluation, see PR https://github.com/ucsdlib/damsrepo/pull/68.
Descriptive summary
Propose a solution to replace Jhove with FITS and other tools to support more file types in general.
Rationale
We've used Jhove in DAMS to extract technical metadata for many years, while the file formats supported by Jhove are very limited. This ticket is to explore the solution to replace Jhove with FITS and/or other tools for more general file type supports.