richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
214 stars 30 forks source link

Possible to determine EOF from BOF from basis? #194

Closed ross-spencer closed 2 years ago

ross-spencer commented 2 years ago

Given this example from a TGA: 'extension match tga; byte match at 4261283, 18' is Siegfried saying it read 4mb to read the last 18 bytes of the file, or is it seeking 18 bytes from the end of the file? TGA uses is identified using: TRUEVISION-XFILE.<null> last 18 bytes of the file. So, only 18 bytes are needed.

Does it matter? I imagined the max offsets creating small window either side of the file payload, e.g. 1000 bytes max from BOF or 500 bytes max from EOF. Establishing through SF alone requires knowing which values may be BOF or EOF?

filename : 'MARBLES.TGA'
filesize : 4261301
modified : 2022-06-02T13:14:16+02:00
errors   :
matches  :
  - ns      : 'pronom'
    id      : 'fmt/402'
    format  : 'Truevision TGA Bitmap'
    version : '2.0'
    mime    :
    basis   : 'extension match tga; byte match at 4261283, 18'
    warning :
richardlehane commented 2 years ago

it didn't read 4mb to match, this pattern would have been found during an EOF scan, it is just that in the basis field all offsets are reported as BOF offsets. The second value (18) is the length of the pattern match. Sometimes a signature requires multiple patterns to match e.g. a BOF and an EOF. In those cases you'll get a list of offset, length pairs e.g. byte match at [[0 14] [1822 2]]

You can of course convert it to an EOF offset by deducting it from the file size. If you want to establish a max window of BOF/EOF offsets, perhaps you could convert to EOF and assume it is an EOF offset if lower?

ross-spencer commented 2 years ago

You can of course convert it to an EOF offset by deducting it from the file size. If you want to establish a max window of BOF/EOF offsets, perhaps you could convert to EOF and assume it is an EOF offset if lower?

And now I'm having deja-vu! There must be something in previous emails or demystify issues talking about this.

I'll need to improve the subroutine here. This single EOF sequence is missing from that.

Ideally, I think what I'd prefer is the indicator to be provided by the tool like Siegfried, it might be something for DROID to decide on too (related to: https://github.com/digital-preservation/droid/issues/773). But conceptually, I suspect what I'd like is wrong, because I'm essentially trying to think about this in terms of signature development, where it explicitly says there's an instruction to Siegfried/DROID that something is a BOF and EOF, and maybe those should be clear to be optimized too, but as a consumer of that information, Siegfried's concerns and what it displays back to the user is different? (but because we can put a heuristic into a tool like demystify or A.N. Other tool to work it out, it doesn't matter!)

Okay thanks Richard. Thinking out loud here, and caught between two different issues. Will close this and keep thinking about it.