richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

SF writing null bytes in -csv mode #95

Closed tw4l closed 7 years ago

tw4l commented 7 years ago

Have run across this several times now. Have isolated an example and will send along by email.

The issue appears to relate to file names and character encoding in some way, at least as one possible cause.

tw4l commented 7 years ago

Other useful details:

In most recent instance, this behavior happened with siegfried 1.6.7 -- default.sig (2016-11-22T20:59:52+11:00), identifiers: - pronom: DROID_SignatureFile_V88.xml; container-signature-20160927.xml in Ubuntu 16.04 LTS (Bitcurator ).

Maybe 6 months ago, I ran into the same issue with files from a different archive using an older version of Siegfried (not sure exactly which version) in OS X 10.9 Mavericks on a 2015 Macbook Pro.

richardlehane commented 7 years ago

thanks for this report Tim. The underlying issue here seems to be zip file name encoding. See this this blog post for background as to why zip really sucks in this respect: https://marcosc.com/2008/12/zip-files-and-encoding-i-hate-you/ !

The concrete issue for you is that your example file has a really weird UTF8 name which SF is incorrectly detecting as Latin/ IBM437 (https://en.wikipedia.org/wiki/Code_page_437) and trying to decode in that way. That decoding process introduces the NUL value. The culprit code is: https://github.com/richardlehane/characterize/blob/master/zipname.go

I'll tidy up this function to improve encoding detection and possibly also introduce a printable character check as fail safe.

This fix will be in the next release which I was hoping would be this month but may be next :)

richardlehane commented 7 years ago

Hi Tim this should now be fixed in sf 1.7.0