richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
217 stars 30 forks source link

Roy issue #187

Closed gleporeNARA closed 1 year ago

gleporeNARA commented 2 years ago

When attempting to use a PRONOM signature file that contains (58464952|52494658) as the signature (created with Ross' tool) I get a roy crash:

sudo roy build -extend Generic-RIFX-Container-1.0-signature-file.xml rifxgen.sig 2022/05/10 12:00:15 parse error dev/1: empty sequence

The above signature would match either RIFX or XFIR at the beginning of a file.

Not sure if this is an issue with roy or with ffdev.info, but having the ability to match multiple start sequences would be useful, especially for formats with both big and little endianess.

Signature attached.

RIFX-big-and-little-1.0-signature-file.zip

richardlehane commented 2 years ago

Hi Greg I suspect roy is right to reject this signature. I checked the DROID signatures file and there are no signatures with empty sequence elements. Did you try loading this extension file in DROID, does it work for it?

Good news is there are other ways to achieve the outcome you want. I think the most idiomatic is just to define two "Internal Signatures" for the format. There are many of these in PRONOM e.g. fmt/1458

I'm not sure if Ross's tool allows you to make a signature like this but hand crafted it would look like:

image

gleporeNARA commented 2 years ago

Hey! There's a corresponding issue for Ross' tool (https://github.com/exponential-decay/signature-development-utility/issues/26) and I agree the manual method works. I don't quite understand the error however. Can you explain the "empty sequence elements" error? It seems a simple either/or construction to me.

ross-spencer commented 2 years ago

On my part Greg the translation from an input signature to an output signature is a pretty naive translation of the algorithm defined by PRONOM back in the day, to code. We don't do any validation etc. Here, it's just creating the sequences for an option sub-sequence and outputting it. In reality though, you need an anchor for both those options to branch out from, e.g. ANCHORBYTES shown as a placeholder below.

<ByteSequence Reference="BOFoffset">
    <SubSequence MinFragLength="0" Position="1" SubSeqMaxOffset="0" SubSeqMinOffset="0">
      <Sequence>ANCHORBYTES</Sequence>
      <DefaultShift>2</DefaultShift>
      <Shift Byte="00">1</Shift>
      <RightFragment MaxOffset="0" MinOffset="0" Position="1">58464952</RightFragment>
      <RightFragment MaxOffset="0" MinOffset="0" Position="1">52494658</RightFragment>
    </SubSequence>
  </ByteSequence>

vs.

  <ByteSequence Reference="BOFoffset">
    <SubSequence MinFragLength="0" Position="1" SubSeqMaxOffset="0" SubSeqMinOffset="0">
      <Sequence/>    <!--- this is your empty sequence I believe -->
      <DefaultShift>1</DefaultShift>
      <RightFragment MaxOffset="0" MinOffset="0" Position="1">58464952</RightFragment>
      <RightFragment MaxOffset="0" MinOffset="0" Position="1">52494658</RightFragment>
    </SubSequence>
  </ByteSequence>

So we end up with <Sequence/> in the faulty one vs. <Sequence>ANCHORBYTES</Sequence>. So yeah, without something at the true beginning of file, it starts to look like two signatures.

I hope I can make that easier for you to define soon. I mentioned chatting to David, but also, I think about workarounds.

I think it's really cool you're thinking of signatures like this btw. It's an intuitive idea to me.

richardlehane commented 1 year ago

Closing this one for now as appears to be a syntax/PRONOM issue