neilharvey / FileSignatures

A small library for detecting the type of a file based on header signature (also known as magic number).
MIT License
250 stars 41 forks source link

Multiple Signatures and Offsets #28

Closed justin-dx closed 4 years ago

justin-dx commented 4 years ago

I've been adding my own file extensions, mainly for raw camera images (cr2, nef, orf, dng etc). Some of these raw file types use magic numbers at multiple offsets e.g.: 08 00 00 00 offset 4 AND 2D 00 FE 00 offset 8. Both sets of these bytes must match at the specified offsets to be considered a match for the file type. Is it possible to handle multiple signatures for a single stream or is this handled by the package already?

tiesont commented 4 years ago

I handled this for the AVI video type by overriding the IsMatch(Stream) method - if you know where the "fixed" magic bytes are, you can just check those byte ranges. I don't know that what I did was the most efficient way (I assume it's not), but it seems to work. If it helps:

public abstract class Video : FileFormat
{
    protected Video(byte[] signature, string mediaType, string extension, int offset = 0) : base(signature, mediaType, extension, offset)
    {
    }
}

public class AviVideo : Video
{
    // bytes 4 to 7 are variable --> we're using a placeholder byte (00) for those values
    private static readonly byte[] expected = new byte[] { 0x52, 0x49, 0x46, 0x46, 00, 00, 00, 00, 0x41, 0x56, 0x49, 0x20 };
    private const string mediaType = "video/x-msvideo";
    private const string extension = ".avi";

    public AviVideo()
        : base(expected, mediaType, extension)
    {
    }

    public override bool IsMatch(Stream stream)
    {
        byte[] header;

        if (stream != null)
        {
            using (var ms = new MemoryStream())
            {
                stream.CopyTo(ms);
                header = ms.ToArray();

                // We need at least 12 bytes of data to check for an AVI header
                if (header.Length > 11)
                {
                    // bytes 0 - 3 must match expected values
                    for (int i = 0; i < 4; i++)
                    {
                        if (expected[i] != header[i])
                        {
                            return false;
                        }
                    }

                    // bytes 8 - 11 must match expected values
                    for (int i = 8; i < 12; i++)
                    {
                        if (expected[i] != header[i])
                        {
                            return false;
                        }
                    }

                    return true;
                }
            }
        }

        return false;
    }
}

You don't really need the base class - I created that because I needed to add multiple video formats and wanted to be able to something like

var isVideo = format is Video;
neilharvey commented 4 years ago

Overriding IsMatch like @tiesont described is the way to do it.

Multiple signatures within a file extension usually indicates that a format is a specialisation of something else - e.g. CR2 is apparently based on TIFF so if I were doing it in the library I would inherit from the base class and override IsMatch but if it's just for your own use just do whatever works :)

(While looking at this I realised that I'm including the TIFF byte order marks in the file signature so I should go and fix that)