sindresorhus / file-type

Detect the file type of a file, stream, or data
MIT License
3.64k stars 345 forks source link

Extend file type (updated) #603

Closed FredrikSchaefer closed 10 months ago

FredrikSchaefer commented 1 year ago

Hey There :)

I came across this issue #340 and found that it's just what we need. As the initial PR is way outdated, I decided to start over from scratch.

Resolves #340.

FredrikSchaefer commented 1 year ago

Gentle nudge towards @sindresorhus . We are really looking forward to using this feature in the next official version.

sindresorhus commented 1 year ago

I'm on vacation, so I won't be able to review this for another couple of weeks.

FredrikSchaefer commented 1 year ago

I'm on vacation, so I won't be able to review this for another couple of weeks.

Thanks a lot for squeezing in time to send change requests, even on vacation! We will then see that we don't rely on quick action from your side. Enjoy your time off!

Borewit commented 1 year ago

I am on holiday as well, this time without a laptop. Back in 2 weeks.

FredrikSchaefer commented 1 year ago

Another gentle nudge towards @sindresorhus and @Borewit , so this PR does not get lost in the the vortex of time.

FredrikSchaefer commented 12 months ago

... And yet another gentle nudge to @sindresorhus . I'd really appreciate if you found the time for a review :-)

FredrikSchaefer commented 11 months ago

Is there anything I can do to support the review of this PR?

Borewit commented 11 months ago

I will try find some time as well to review this PR @FredrikSchaefer. Please give me 2 weeks.

FredrikSchaefer commented 10 months ago

I don't think we should throw an error when the scanner has moved the position and returned null. As it may not necessary bad implementation of the custom scanner.

We just have to return undefined (so not execute any of the following scanners).

Wouldn't that be a bit inconsistent with the wording here?

It currently says:

If the detector returns undefined, it is not allowed to read from the tokenizer (the tokenizer.position must remain 0) otherwise following scanners will read from the wrong file offset.

If the detector returns file_type_undetectable, the detector is certain the file type cannot be determined, even by other scanners. The FileTypeParser interrupts the parsing and immediately returns undefined.

This indicates to me that, in the first scenario the logic behaves different from the second scenario, and also it is not allowed and must tells me that doing so is really undesired. I think it would help debugging if an error results out of that.

If you don't agree, should we adjust the wording?

Borewit commented 10 months ago

If you don't agree, should we adjust the wording?

What I am saying is, and what I have been saying, is that when a detector decides to to a deep scan (start reading / consuming the stream / tokenizer) it may not necessary return a file type, but return undefined instead. In that case file-type should stop trying other detectors as the stream has been consumed and return undefined as the final conclusion.

For example when you a detector detects an ID3 header, which can be 2 MB, you need to read from the stream to get to the actual audio part. There is no guarantee the detector will be able to detect the file-type coming after that. So let's assume this is the case, there is just white noise after the ID3 header. In that case we cannot fall back on any more generic file-type, as ID3 header is insufficient to determine the file-type, highest probability it is audio/mpeg, but would be speculation. The detector should return undefined here, the position has been moved so we do not proceed with any other detectors, we should return undefined to the user. There is no reason to raise an error.

Borewit commented 10 months ago

Before someone else is pointing this out, so why are we guessing the file-type then in the existing code?

https://github.com/sindresorhus/file-type/blob/ca7a03c26761444e024764a38d16597455281cfd/core.js#L206-L210

Good question.

FredrikSchaefer commented 10 months ago

If you don't agree, should we adjust the wording?

What I am saying is, and what I have been saying, is that when a detector decides to to a deep scan (start reading / consuming the stream / tokenizer) it may not necessary return a file type, but return undefined instead. In that case file-type should stop trying other detectors as the stream has been consumed and return undefined as the final conclusion.

For example when you a detector detects an ID3 header, which can be 2 MB, you need to read from the stream to get to the actual audio part. There is no guarantee the detector will be able to detect the file-type coming after that. So let's assume this is the case, there is just white noise after the ID3 header. In that case we cannot fall back on any more generic file-type, as ID3 header is insufficient to determine the file-type, highest probability it is audio/mpeg, but would be speculation. The detector should return undefined here, the position has been moved so we do not proceed with any other detectors, we should return undefined to the user. There is no reason to raise an error.

Thank you very much for the detailed explanation. It helps me understand your reasoning. In that scenario, doesn't it make more sense to use recursion? Example:

let parser;
const recursiveDetector = async tokenizer => {

// Some logic reading from the tokenizer

await parser.fromTokenizer(tokenizer);
};

const customDetectors = [recursiveDetector];
parser = new FileTypeParser({customDetectors});

const result = await parser.fromTokenizer(someTokenizer);

All in all, the current behavior looks like being a bit clearer and less prone to error, while still being flexible enough to not block any valid use. In case you insist on that change, I'd really appreciate if you kindly provided an updated version of the docstring here. Honestly, I would like to finalize this soon, and therefore it would be great if we could cut down the number of review iterations a bit...

Borewit commented 10 months ago

I would like to finalize this soon

Sure, don't throw the error. Yes we do use recursion in that scenario, and that is exactly why we need need to return a final undefined instead of an error. So in the recursion all detectors are put in to play, but if they do not manage to resolve the file type, the ID3 detector need return a final undefined.

I'd really appreciate if you kindly provided an updated version of the docstring

I propose something like this:

Custom detectors can be used to add new FileTypeResults or to modify return behaviour of existing FileTypeResult detections. If the detector returns undefined, there are 2 possible scenario's:

  1. The detector has not read from the tokenizer, it will proceed with the next available detector
  2. The detector has read from the tokenizer (tokenzier.position is has not been increased), in that case no further detectors will be executed and the final conclusion is that file-type returns undefined. Note that this an exceptional scenario, as the detector takes the opportunity from any other detector to determine the file type.
FredrikSchaefer commented 10 months ago

I'm sorry, I missed that part in explaining my reasoning:

In the case of recursion (following my argumentation), the logic should look like this:

let parser;

const recursiveDetector = async tokenizer => {

    // Some logic reading from the tokenizer

    return await parser.fromTokenizer(tokenizer) || FileTypeParser.detectionImpossible;
};

const customDetectors = [recursiveDetector];
parser = new FileTypeParser({customDetectors});

const result = await parser.fromTokenizer(someTokenizer);

I'm a bit concerned that users unintentionally read from the tokenizer, and then are surprised the get an undefined file type result. In that scenario, an error could be helpful. That's why am a bit resistant here.

If that still doesn't convince you, I'll not throw the error.

Borewit commented 10 months ago

We essentially agreed to replace the explicit "detection-impossible" with a mechanism that detects the tokenizerer has been consumed (result == undefined & tokenizer-position-has-been-changed). So that's the same thing, but easier for the user to implement.

Yes, it is a risk (limitation) users will read from the tokenizer, and are unable to detect the type, therefore return null, and misunderstand the mechanism and expect the remaining detectors to proceed. But if we throw an error, we also make perfectly valid cases impossible to implement.

That same risks (limitation) is present in the internal detectors. If we restrict the power of the custom detectors they are pretty much useless, as there is mostly a need to cover complex detection (require deep reading) more then simple (first few bytes), since most of those simple ones are probably already implemented by us.

If I borrow you my car, you can crash it.

FredrikSchaefer commented 10 months ago

Okay then. I adjusted the logic. The docstring you suggested is also included, I just polished it a bit.

Borewit commented 10 months ago

Forgive me, wrong account, same person.

Borewit commented 10 months ago

@sindresorhus, kind reminder, please proceed with merging if you are happy with this one