richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

Quicktime identification very slow with recent version #99

Open orzel opened 7 years ago

orzel commented 7 years ago

Hello,

When using siegfried over a quicktime (x-fmt/384) file of typical size ~200G, the identification time exploded from very few seconds (siegfried 1.5) to dozens of minutes (siegfried 1.7).

First test (extract from the json output): "siegfried":"1.5.0" "signature":"default.sig" "details":"DROID_SignatureFile_V84.xml; container-signature-20160121.xml" "basis":"extension match mov; byte match at 0, 12 (signature 8/11)"

Second test : "siegfried":"1.7.0" "signature":"default.sig" "details":"DROID_SignatureFile_V88.xml; container-signature-20160927.xml" "basis":"extension match mov; byte match at [[[4 8]] [[2046628976 12]]] (signature 4/8)"

Our current guess is the following: according to the output, the signature has changed to now also use one (or some) bytes at the very end of the file. And for some unknown reason, siegfried reads the whole file to checks those bytes (instead of seeking there as I would expect).

Do you have any more information about this ? Can you confirm this analysis ? Is there any reason blocking the use of 'seek' to reach the end of the file ?

richardlehane commented 7 years ago

Hi Thomas thanks very much for this report.

The immediate cause of this (massive) slowdown seems to be a change to PRONOM's signatures for x-fmt/384. Looking at the PRONOM release notes (http://www.nationalarchives.gov.uk/aboutapps/pronom/release-notes.xml) there was a change in v86 (with the note "simplified signature"). We can isolate this change by looking at historic versions of the PRONOM database in Ross Spencer's PRONOM archive on github (https://github.com/exponential-decay/pronom-archive-and-skeleton-test-suite/releases).

From the basis field in your results, the first match was against the eighth of eleven signatures for the v84 release, which was this one:

v84

The second match was against the fourth of eight signatures (so three likely removed in that "simplification" of the signatures):

v88

You can see that the first signature can be satisfied very quickly, just by looking at the first few bytes of the file. The second signature will take potentially much longer as that wildcard (the "" in the signature) means that the fragment beginning "moov" can occur anywhere in the file after that first fragment (and is defined at an offset from the beginning of the file, not the end* of the file). In your example file this second fragment occurs right near the end of the file which means > 200 GB of reading and scanning to discover it.

In terms of your question about sf using seek: this depends on the type of input source. For most files, sf uses memory mapping for file access. For files of the size you're dealing with, this won't work, & in these cases sf falls back to a "bigfile" mode. In this mode, sf does indeed seek when scanning end of file sequences. But this doesn't help in this case because the pattern we are looking for ("moov") is defined as a wildcard from the beginning of the file.

One possible solution for you is to customise your signature file using the roy tool.

sf is by default very literal in the way it applies PRONOM signatures. This means if a signature has a wildcard then sf will scan the whole file if necessary to find it, even if that file is 200GB in size. You can override this behaviour by modifying your signature file, e.g. to add limits to the number of bytes from the beginning and end of a file that will be scanned. So roy build -bof 500000 -eof 140000 will force sf to only scan the first 500k and last 140k of a file. Additionally, if you provide values for both -bof and -eof then roy will apply a feature called signature mirroring: this takes wildcard beginning of file sequences (such as the "*moov" pattern in x-fmt/384) and creates additional end of file sequences (e.g. "moov*"), ensuring that maximum use is made of the end of file buffer.

I'm going to label this as PRONOM as I think the issue predominantly lies there. But I will also have a bit of a think to see if any improvements can be made to siegfried to better deal with your use case without having to mod signatures.

Apologies for length of this response, hope it makes sense, cheers Richard

Dclipsham commented 7 years ago

Thanks Richard, Hi Thomas.

An explanation for the change can be found here: https://groups.google.com/forum/#!topic/pronom/Q2mXbNmNTbU

Basically the previous set of Quicktime signatures seemed to have built up over time based primarily on observation each time we hit a variant QT file that wouldn't ID. We were still hitting false negatives in our own collections, and receiving reports of the same from elsewhere, so I took the decision to restart based upon a stricter interpretation of the QT specification, which requires a moov atom.

The moov atom contains the movie metadata and current advised practise (vital for streaming) is to place it at the beginning of the file, but as we've seen it can appear anywhere in the file, not simply either at the start or the end.

For identification I favour accuracy over speed, so I think prefer to leave the signatures as they are now (but I'm happy to hear arguments against). That said, in this specific scenario, with the lead brand being described from offset 4 as 'ftypqt ' - could this be the one variant where it makes sense to drop the necessity for subsequently finding the moov atom?

Perhaps there are also byte seek optimisations that could assist? Note that MP4, plus all of the Broadcast Wave variants (among others) also contain a full wildcard pattern so could also be subject to lengthy ID times (see also Ross Spencer's investigation into the WAV variants: https://github.com/exponential-decay/digital-preservation-stage-boss-one/blob/master/final-report/digital-preservation-stage-boss-one.pdf , and full list of wildcards from v86 DROID signature file: https://github.com/exponential-decay/digital-preservation-stage-boss-one/blob/master/wildcard-signature-information/PRONOM-wildcard-signatures-v86.csv).

David

orzel commented 7 years ago

Guys, I've been away for some days and I only now discover your answers. Thanks a lot for such informative and complete answers, both of you !

I had noticed that I could try to use roy to get a default.sig that better suits my needs, but I had failed to do so. Richard explained clearly my misunderstanding : it's not a sig "at the end" but a wildcard that can happen anywhere.

I did a quick test with roy build -bof 500000 -eof 140000 and the resulting default.sig fails to identify the file, but I'll investigate some more. I have yet to read all the links provided. The one listing all 'wildcards' is especially useful, most of the files we deal with are very big (we do preservation for the cinema industry).

richardlehane commented 7 years ago

Hi Thomas sorry I noticed afterwards that I misread your report: the "moov" fragment in your file is reported at 2046628976 offset from the beginning of the file... I read this as being near the end of your ~200GB file but actually that figure translated to gigabytes is ~2GB, so nowhere near the end (and so a roy build -bof 500000 -eof 140000 solution won't work as those buffers aren't nearly large enough to capture this match). You have some really big QT files!

This makes it quite a hard problem as we can't just do a simple modification to the buffer sizes to speed this up. In order to satisfy the signature as it is now written, we really have to do a wild card search of at least the first 2GB of your file, which is always going to take a long time.

Ultimately the best solution would be a PRONOM change along the lines @Dclipsham suggests: add a new signature that would just match on the "ftypqt" brand, and drop the requirement to also find a "moov" fragment.

Rather than wait for a PRONOM update, you can use the roy build -extend command to add a custom signature along these lines. Custom signatures are normally used to add new, org-specific formats to PRONOM but you can also use them to add new byte signatures to existing PRONOM formats (in this case x-fmt/384).

You can use @ross-spencer 's Signature Development Utility http://exponentialdecay.co.uk/sd/index.htm to build custom signatures. They should go in a /custom folder within your ~/siegfried directory.

Creating the signature looks like this: cap

Note I deliberately didn't put an extension in as the extensions for this format are already defined in the main DROID file and we don't want to override those.

Running this command and then testing looks like this: cap1

For your convenience, here is the extension file that I used: quicktime-ext.xml.zip Just copy the quicktime-ext.xml file into your ~/siegfried/custom folder and do roy build -extend quicktime-ext.xml.

I hope this works for your use case. cheers Richard

orzel commented 7 years ago

Hi, Indeed 2G is only at the "beginning" of our big file. We could deal with bof/eof of ~4G, but roy will fail:

roy build -bof 4294967296 -eof 4294967296 2017/04/25 08:30:38 int overflows uint32

I tested adding your quicktime-ext.xml, and the resulting default.sig only takes 15S to identify a QT file of ~148G. This looks promising, thanks a lot !

I understand perfectly and fully agree with your statement of "accuracy over speed". I don't understand (yet) much of the signature xml format, but i was wondering if it would make sense to use both the previous QT signatures (fast/empirical ?) and the new, stricter ones, in a way that the faster are tried first ? The only 'danger' I can see is if some non-QT file matches the fast signatures, I'm not sure what 'false negative' are in your description (QT not identified as QT or non-Qt identified as such).

If I understood well, you have some non-regression tests with lot of samples, so this would mitigate this risk, wouldn't it ?

richardlehane commented 7 years ago

That roy failure is because I'm storing the ints as unsigned 32 bit integers and never expected anyone to try to add a bigger buffer than 2,147,483,647! Could fix this but to be honest it wouldn't be a big help for you as would still be very very slow to run.

Re. adding all the old signatures into the mix, you could certainly do that & in fact that's pretty much what I did by using -extend flag above, except I only added in one of the old signatures (the one that matches your example file). All the new signatures are in there too, sf stops earlier now because it finds that old signature first and then determines it doesn't need to go any further (as no stronger matches are possible).

The -extend flag can take multiple custom signatures (separated by commas) or you can put multiple custom signatures into a single custom file. Ross's Signature Development Utility doesn't allow you to do multiple signatures in one file but you can hand create (or in this case just copy paste relevant bits from the v84 DROID xml file).

Here's a revised extension file for you that has all the v84 QT sigs: quicktime-ext.xml.zip Process for install the same: copy quicktime-ext.xml file into your ~/siegfried/custom folder and do roy build -extend quicktime-ext.xml.

By including the old QT signatures alongside the new ones, you certainly do re-introduce the risk of some non-QT files matching as QT, if those old signatures are too permissive. But I think given the size of the files you are dealing with is definitely reasonable to customise your signatures to fit your use case.

orzel commented 7 years ago

Works perfectly, thanks again. Do you expect to make any change related to this for the next release ? I now know how to keep going, but of course I'd prefer using a vanilla sig file :)

richardlehane commented 7 years ago

I'm working on a couple of performance improvements but they're pretty subtle & to be honest it is very hard to speed up this kind of case where forced to scan > 2G of bytes to find a wildcard pattern that might not even exist (& I/O being the real time killer). I.e. I may be able to shave some secs, but not minutes.

I think the real fix to this issue would be a PRONOM update to include non-wildcard based signatures for this format - @Dclipsham seemed partly open to this & I'd suggest continuing the discussion with him and the TNA team.

For the default.sig signature file, I think it is important that this represent the PRONOM database exactly, so I don't propose altering it by default with the "quicktime-ext.xml" file. But I will include that custom signature file with the set of custom signatures included with sf releases (you may have seen some archivmatica extension signatures in the custom folder if you install with a package manager). This would mean you would still need to do a roy build -extend quicktime-ext.xml post-upgrade but you wouldn't have any other setup to do (preparing the custom folder etc.) this may help e.g. if setting up on new machines.

Suggest leaving this ticket open until it is fixed at PRONOM end & sorry I can't do more to assist

Dclipsham commented 7 years ago

Yes, I'll adjust the PRONOM signature entry for this specific scenario (applies to the signature 'QuickTime variant4'): where the first atom at offset 4 is FTYP and the major brand is 'qt ' then we won't seek the MOOV atom. This represents a byte sequence of 0x6674797071742020 found from offset 4 only.

Where the first atom at offset 4 is any from MDAT, CMOV, PNOT, SKIP, FREE, WIDE, then we will continue to seek the MOOV atom within the file.

There'll likely be a release in late-May/early-June.

Thanks for raising this. I hope this solution is satisfactory for all.

David

richardlehane commented 6 years ago

this change has now been made in the PRONOM v93 release so will close this issue. Thanks @Dclipsham!

obruchez commented 4 years ago

Hi everybody,

If I still have this problem in 2020, with the latest version of Archivematica, should I open a new ticket?

I have a 170 GB MOV file. It takes almost 2 hours to be identified as a x-fmt/384 / Quicktime file.

siegfried : 1.8.0 scandate : 2020-04-21T11:49:15+02:00 signature : archivematica.sig created : 2020-01-21T23:33:10+01:00 identifiers :

richardlehane commented 4 years ago

arrgh, ghosts of the past! Are you able to post the output showing the id of the file (it would be helpful to see what the "basis" field says)? Thanks for reporting this Richard

obruchez commented 4 years ago

Here's the full output:

---
siegfried   : 1.8.0
scandate    : 2020-04-21T15:59:05+02:00
signature   : archivematica.sig
created     : 2020-01-21T23:33:10+01:00
identifiers :
  - name    : 'archivematica'
    details : 'fddXML.zip (DROID_SignatureFile_V96.xml, container-signature-20200121.xml); extensions: archivematica-fmt2.xml, archivematica-fmt3.xml, archivematica-fmt4.xml, archivematica-fmt5.xml'
---
filename : '486.mov'
filesize : 181985046943
modified : 2020-03-05T17:41:01+01:00
errors   :
matches  :
  - ns      : 'archivematica'
    id      : 'x-fmt/384'
    format  : 'Quicktime'
    version :
    mime    : 'video/quicktime'
    basis   : 'extension match mov; byte match at [[4 4] [181982707224 12]] (signature 2/8)'
    warning :
richardlehane commented 4 years ago

that makes sense, it's matching against this pattern: mdat*moov{0-4096}(mvhd|cmov|rmra) for x-fmt/384 It has to scan nearly the whole file to find that moov segment after the wildcard which is why it is so slow.

@Dclipsham fixed this issue last time by adjusting one of the patterns for this format. Perhaps a similar solution is possible this time round?

Another possible solution is a tailored signature file along one the lines suggested above e.g. if you do roy build -bof 5000000 -eof 300000 you should get a speed-up.

I do have some ideas for performance improvements but realistically any scenario in which you need to scan all ~180GB of this file for a match will be slow.

obruchez commented 4 years ago

I understand. The largest MOV file we have has a size of 1.5 TB, so it could be even worse... :-)

Just as a reference point (I haven't looked at the code or anything), the whole transfer/ingest phases in Archivematica took about 3 hours:

So my feeling is that even if Siegfried has to look at the whole file, it could be way faster (e.g. the hash computation phase also had to read the whole file and do something with it). But to be honest I don't know the history/complexity of the project, so sorry if I'm way off.

richardlehane commented 4 years ago

Thanks for those numbers, they're an interesting comparison and something to aim for!

I have some optimisations in mind that I hope will improve performance for cases like this but format ID is an expensive task (consider that PRONOM contains 1000s of regex like patterns that all need to be searched) that will always be relatively slow if you need to do full file scans.

If you are routinely dealing with big Quicktime files like these, and if you need a PRONOM-based identification, I'd suggest:

Dclipsham commented 4 years ago

Tricky one. The moov atom is vital for ID of Quicktime - things won't play without it. We try to anchor it near the beginning of the file with one of the other atoms (mdat, cmov, free etc) so we'll only search the rest of the file if we find one of those atoms first, but unfortunately the moov can appear anywhere in the file (although I understand best practice is near the beginning, particularly for streaming).

A while back we were only seeking certain atoms (not necessarily including moov) but that gave us false positives so I'm not sure what we could do to make the signature any more efficient...

obruchez commented 4 years ago

Thanks. I'll try roy. For now, I'm using the "File Extension" tool in Archivematica as a workaround. Fido seems to misidentify our files (as Quicktime + Apple ProRes, which they are not...). I don't know why.

richardlehane commented 4 years ago

If modifying your signatures with roy, there's a few different approaches you could take (such as setting fixed -bof or -eof or limiting your scan to a set of fmts), but the cleanest may be just to extend your signature file with the old PRONOM sigs, as described here: https://github.com/richardlehane/siegfried/issues/99#issuecomment-296950848

If you're running within archivematica, your siegfried home path won't be ~/siegfried it will be something else, which you should be able to find by doing sf -v