richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

Siegfried seems to skip certain files without error or warning #115

Closed MSavels closed 5 years ago

MSavels commented 6 years ago

Hi,

I'm currently comparing the results from DROID and Siegfried (through Brunnhilde). In a dataset containing 216420 files, there are only 2537 discrepancies between the two (roughly 1%), which imho is not bad. However, in my test at least 50% of these discrepancies are due to Siegfried apparently skipping a file. A comparison of the outputs by roy yields "missing" from the siegfried CSV (confirmed by manually checking the Siegfried CSV: they aren't there, so no mistake by roy). I redid the Brunnhilde analysis several times and each time the same files were skipped. I analysed a few of these files (TIFF's in this case) with other programs (JHOVE, DPF Manager) and there seemed to be nothing wrong with them. I also checked whether it might be due to long paths/filenames, non-standard characters in the filename, too many files in a directory or extremely large files, but none of these things seemed a problem. This was confirmed by an individual analysis of each file with Siegfried: the files were correctly analysed. But when I tried to analyse the directory directly with Siegfried, the same files were skipped again. I have no idea why, but I can provide you with the files and the different analyses if you need them.

Kind regards,

Maarten

richardlehane commented 6 years ago

Hi Maarten thanks for reporting this - it is a strange issue and to confess I'm a bit stumped!

Could you advise what OS you're on and what version of siegfried (sf -version)? Do the files have any access restrictions different to other files in the directory (I'd still expect an error but possibly worth checking)?

Getting the files from you likely won't help if they can be identified individually, the problem seems more to do with their place in the file system... but if you could narrow down the issue and provide a zipped minimal directory with selected files that triggers the issue that would be a great help. Happy for you to send things to richard@itforarchivists.com

cheers Richard

MSavels commented 6 years ago

Hi Richard,

The OS is CentOS Linux 7.4.1708 Siegfried version is 1.7.8 with signature V93 and containers sig 20171130 The files themselves are on a different server, mounted in the CentOS-server.

I checked the rights too, no anomalies there: all files have the same permissions, regardless of whether they were skipped or analysed.

I'll shortly be sending you a package containing 236 files. 4 of them were consistently skipped during additional tests. The other ones are all the files in one directory that was skipped entirely.

However -the plot thickens- I redid the same tests on a back-up I have of these files (the files are totally identical, they have the same sha256-hashvalue). Here the previously skipped files were analysed as normal, but different files were skipped. So I doubt it has anything to do with the files themselves, more with the way a list of them is built.

Kind regards,

Maarten

richardlehane commented 6 years ago

Thanks Maarten, I'm downloading the files now.

If you're scanning files over a network connection, it might be worth trying the -throttle flag to see if it assists. E.g. sf -throttle 50ms DIR. This may help narrow the issue down.

MSavels commented 6 years ago

Tried it both with -throttle 50ms and 100ms. The same files were skipped.

richardlehane commented 6 years ago

The files all scanned correctly on my Windows laptop (i.e. 236 files in the zip, and 236 files in the results file). This does seem to be related to the way sf is walking your file system, rather than relating to the file contents.

richardlehane commented 6 years ago

OK this golang bug seems like a possible cause: https://github.com/golang/go/issues/24015

Unfortunately if this is the bug then it may be necessary to wait for a RedHat update to fix this. In later versions of the linux kernel (> 3.10) this problem seems to have been fixed

richardlehane commented 6 years ago

If this is a kernel bug, a workaround pending a fix may be to use another tool like ls or find to manage the directory walk and pipe the list of files to sf for scanning.

Like:

find DIR -type f | sf -f -

MSavels commented 6 years ago

The golang bug-workaround (enforcing CIFS version 1.0 on mount) didn't work. The same files were skipped. Piping the list in from find, however, did work. No files were skipped then. So for me, that solved it. Thanks for the help. Kind regards, Maarten

richardlehane commented 6 years ago

the recent golang 1.11 release has introduced a fix for this issue. I'm hopeful that a siegfried binary built with 1.11 will resolve this.

Unfortunately v1.7.9 binaries are still built with 1.10 as that is the current release supported by travis/appveyor. So will leave it open until the release binaries are built with 1.11

richardlehane commented 5 years ago

Hi Maarten I released v1.7.10 with new binaries built with golang 1.11. This should, I believe, finally fix this issue. Will close now but please reopen if you can still reproduce cheers Richard