We're running format identification on files in Amazon's S3 object storage. This works well until we run into certain corrupt files like the jar file at s3://aptrust.public.download/bug/alov_applet.jar.
When Siegfried tries to read this file, we get a segfault that crashes our format identifier. If you download that jar file and run sf alov_applet.jar, you'll see the following:
If you open the jar file with emacs or unzip, you'll see it contains invalid header entries. The segfault likely comes from Siegfried trying to read entries at invalid offsets. For this particular file, unzip spits out a lot of messages like this:
file #2: bad zipfile offset (local header sig): 61
file #3: bad zipfile offset (local header sig): 231
Unzip somehow handles the bad offsets without crashing.
In our use case, we're dealing with millions of files, and Siegfried is one step in a long processing pipeline. We've run across a few files in testing that trigger this crash, and we expect to see more, given the volume we process. We can handle graceful shutdowns from sigterm, but when we get a hard crash from a segfault, the whole pipeline stops.
We're running format identification on files in Amazon's S3 object storage. This works well until we run into certain corrupt files like the jar file at s3://aptrust.public.download/bug/alov_applet.jar.
When Siegfried tries to read this file, we get a segfault that crashes our format identifier. If you download that jar file and run
sf alov_applet.jar
, you'll see the following:If you open the jar file with emacs or unzip, you'll see it contains invalid header entries. The segfault likely comes from Siegfried trying to read entries at invalid offsets. For this particular file, unzip spits out a lot of messages like this:
Unzip somehow handles the bad offsets without crashing.
In our use case, we're dealing with millions of files, and Siegfried is one step in a long processing pipeline. We've run across a few files in testing that trigger this crash, and we expect to see more, given the volume we process. We can handle graceful shutdowns from sigterm, but when we get a hard crash from a segfault, the whole pipeline stops.
In case you're curious about how we call Siegfried, the relevant code is at https://github.com/APTrust/preservation-services/blob/master/ingest/format_identifier.go#L86
Thanks for any help you can offer.