sindresorhus / file-type

Detect the file type of a file, stream, or data
MIT License
3.63k stars 345 forks source link

Custom detectors running after default detectors #628

Open farski opened 6 months ago

farski commented 6 months ago

We've run into a situation where a lot of MP3 files that we're getting from a particular source include a junk byte that is throwing off file type detection. (They seem to include an extra, unnecessary byte after the ID3 headers, but before FF E0, so the MPEG detector is missing them.)

We're looking to add a slightly more permissive, low priority check that can deal with this junk byte, but not at the risk of breaking any existing detectors, and therefore we'd only want it to run if no other default or custom detectors were successful.

Is that an enhancement that the project would consider?

Borewit commented 2 months ago

How do know you have to read your way through the junk, byte by byte, if you are even unsure at that point what you are reading?

If you know the file is an MP3 and want a kind of check if it actually is, you maybe be better of using music-metadata. It will read though the junk, but only if it can derive the file type from the filename or content-type.

farski commented 2 months ago

@Borewit At the point we are running file-type, we don't know if the file is an MP3. It could be any audio file type that we accept.

These banged up MP3 files that we've run into don't get detected as any file type at the moment.

That's why the functionality I'm looking for is to be able to place additional, custom checks after all other checks, and only if none matched. In the case of these malformed MP3 files, all the native checkers would fail. Then, this custom checker would run, and it would know specifically how to detect these malformed MP3 files. The check would look nearly identical to the existing MPEG/MP3 detector in file-type, but would be slightly more permissive, and thus would catch files even with the junk byte. Doing it in this order means no well-formed files that are matched by native detectors would be false positives. There is perhaps some vanishingly small chances that the custom detector does result in some false positives for us for non-MP3 files that also don't match any normal detectors, but that's something we'd be willing to live with, and given the workflow we'd be using this functionality in would essentially be a non factor.

Borewit commented 2 months ago

It is a chicken and egg issue.

MP3 files prefixed by junk do appear in the wild, and due to the way they are encoded, MP3 decoders keep reading the junk until they the find a synchronization frame. How on earth folks manage to get this junk prefixing the MP3 is a mystery to me, as this is totally unnescacary. And none of the well respected encoders does generate this junk.

There are MP3 with, and without a prefixing ID3 headers.

Let's give an MP3 file with no ID3 header, and 66kb of prefixed junk to file-type. For file-type this could be any file. How for should we go to detect it a (close to corrupted) MP3 file, when it could be any file. How much junk should we read, and how much CPU time should we waste to this, knowing this will be done to files which are not MP3s?

If we detect a file with an ID3 header, it becomes likely it is an MP3, but there are exceptions. as other audio formats also put ID3 headers in front of them. In line with the standars or not, it is being done.

When we do a "deep read" in file-type, which is sometimes done to determine the sub type, there is no point of return. We start to read from the stream, that data is getting consumed. In case of ID3 header, that is done as well, but resolved with a recursion. The ID3 data is consumed, read from the stream, but after the ID3 header, we restart the file detection.

You could indeed add a deep MP3 file checker at the very end. But there can be one and only one such "deep" checker at the very end, so maybe reasonable to make this a custom resolver.

farski commented 2 months ago

I agree that there are types of corruption (with MP3s and other file types) that lead to the situation you're describing. Handling the specific situation we were running into would not have required a deep read; it only required a very small modification to the existing MPEG check. I'm not saying that this type of MP3 detection would be useful to anyone else, but for whatever reason, we continue to see a lot of MP3 files coming from one particular competitor with an identical 1-byte malformation, and when customers move from that platform to ours we get an influx of undetectable files due to this bug. (Which we are currently handling with a workaround.)

If there were a way to inject low-priority checkers, we could very easily solve this issue, without introducing any risk of short-circuiting the existing, correct, highly trustworthy detectors that ship with file-type.

We're not trying to solve for all possible ways that MP3 files can be banged up. For our business case, being able to handle this very specific 1 byte case would have been sufficient, but only if it could be done in a low priority way. That is where the request for having an option for running some custom detectors after all other detectors came from.

Borewit commented 1 month ago

Can you provide a sample of such a file @farski ?

farski commented 1 month ago

@Borewit I could provide one to you privately; it's publicly-available content, but we don't hold the rights to so I can't post it here.

Borewit commented 1 month ago

I believe I have not received any sample yet @farski , please include a reference to this issue, if you do send it.

Got it.

Borewit commented 1 month ago

This custom detector does the job:

import {NodeFileTypeParser} from 'file-type';
import path from 'node:path';
import { fileURLToPath } from 'node:url';

const filename = fileURLToPath(import.meta.url);
const dirname = path.dirname(filename);

function check(buffer, headers, options) {
  options = {
    offset: 0,
    ...options,
  };

  for (const [index, header] of headers.entries()) {
    // If a bitmask is set
    if (options.mask) {
      // If header doesn't equal `buf` with bits masked off
      if (header !== (options.mask[index] & buffer[index + options.offset])) {
        return false;
      }
    } else if (header !== buffer[index + options.offset]) {
      return false;
    }
  }

  return true;
}

function scanMp3(sampleBuffer) {
  // Check MPEG 1 or 2 Layer 3 header, or 'layer 0' for ADTS (MPEG sync-word 0xFFE)
  if (sampleBuffer.length >= 2 && check(sampleBuffer, [0xFF, 0xE0], {offset: 0, mask: [0xFF, 0xE0]})) {
    if (check(sampleBuffer, [0x10], {offset: 1, mask: [0x16]})) {
      // Check for (ADTS) MPEG-2
      if (this.check([0x08], {offset: 1, mask: [0x08]})) {
        return {
          ext: 'aac',
          mime: 'audio/aac',
        };
      }

      // Must be (ADTS) MPEG-4
      return {
        ext: 'aac',
        mime: 'audio/aac',
      };
    }

    // MPEG 1 or 2 Layer 3 header
    // Check for MPEG layer 3
    if (check(sampleBuffer, [0x02], {offset: 1, mask: [0x06]})) {
      return {
        ext: 'mp3',
        mime: 'audio/mpeg',
      };
    }

    // Check for MPEG layer 2
    if (check(sampleBuffer, [0x04], {offset: 1, mask: [0x06]})) {
      return {
        ext: 'mp2',
        mime: 'audio/mpeg',
      };
    }

    // Check for MPEG layer 1
    if (check(sampleBuffer, [0x06], {offset: 1, mask: [0x06]})) {
      return {
        ext: 'mp1',
        mime: 'audio/mpeg',
      };
    }
  }
}

async function deepMp3Detector(tokenizer) {
  const maxDepth = 3;
  const buffer = new Uint8Array(2 + maxDepth);
  await tokenizer.peekBuffer(buffer);
  for(let depth = 0; depth < maxDepth; ++depth) {
    const type = scanMp3(buffer.subarray(depth));
    if (type)
      return type;
  }
}

async function parseFromFile(path) {
  const customDetectors = [deepMp3Detector];
  const parser = new NodeFileTypeParser({customDetectors: customDetectors});
  return await parser.fromFile(path);
}

const result = await parseFromFile(path.join(dirname, 'sample', 'CNE7088657618.mp3'), {});
console.log(JSON.stringify(result));