richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
217 stars 30 forks source link

Segfault on corrupt jar files #181

Closed diamondap closed 2 years ago

diamondap commented 2 years ago

We're running format identification on files in Amazon's S3 object storage. This works well until we run into certain corrupt files like the jar file at s3://aptrust.public.download/bug/alov_applet.jar.

When Siegfried tries to read this file, we get a segfault that crashes our format identifier. If you download that jar file and run sf alov_applet.jar, you'll see the following:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x126dc49]

goroutine 8 [running]:
github.com/richardlehane/siegfried/internal/bytematcher.(*Matcher).identify(0x0, 0x0, 0xc000032c60, 0x0, {0x0, 0x0, 0x0})
    /Users/diamond/go/pkg/mod/github.com/richardlehane/siegfried@v1.9.2/internal/bytematcher/identify.go:28 +0x49
created by github.com/richardlehane/siegfried/internal/bytematcher.(*Matcher).Identify
    /Users/diamond/go/pkg/mod/github.com/richardlehane/siegfried@v1.9.2/internal/bytematcher/bytematcher.go:173 +0x11d

If you open the jar file with emacs or unzip, you'll see it contains invalid header entries. The segfault likely comes from Siegfried trying to read entries at invalid offsets. For this particular file, unzip spits out a lot of messages like this:

file #2: bad zipfile offset (local header sig): 61 file #3: bad zipfile offset (local header sig): 231

Unzip somehow handles the bad offsets without crashing.

In our use case, we're dealing with millions of files, and Siegfried is one step in a long processing pipeline. We've run across a few files in testing that trigger this crash, and we expect to see more, given the volume we process. We can handle graceful shutdowns from sigterm, but when we get a hard crash from a segfault, the whole pipeline stops.

In case you're curious about how we call Siegfried, the relevant code is at https://github.com/APTrust/preservation-services/blob/master/ingest/format_identifier.go#L86

Thanks for any help you can offer.

richardlehane commented 2 years ago

thanks for reporting this, I'll take a look