richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
222 stars 30 forks source link

Consider enabling modifiable BOF/EOF for signatures #258

Open ross-spencer opened 1 month ago

ross-spencer commented 1 month ago

It might be that files might not be identifiable because of some amount of padding at the beginning or end of a file. An experimental feature for Roy/Siegfried might see the potential to extend the size of all BOF offsets for all signatures in a signature file, e.g. always look for a BOF beyond the first 0-n bytes. This could be the same for EOF. It might enable the identification of less uniform data than PRONOM like signatures are normally looking for.

I suppose too the BOF/EOF limitation could also be removed entirely and this would lend itself to potentially match all sequences in a given stream of data. That might be a separate experimental feature? Performance would be interesting to understand.

NB. connected to this discussion here: https://twitter.com/beet_keeper/status/1814961844479574089

richardlehane commented 1 month ago

sounds interesting and doable as a feature for roy