Closed rebeccapowell closed 3 years ago
The file should only be read in it's entirety in certain circumstances. The IFileFormatReader
interface is used by a couple of formats, specifically OfficeOpenXml
and CompoundFileBinary
.
OfficeOpenXml
(Office) files are based on the Zip format and the detection is done by looking at the list of entries in the zip file for a particular item which we use to identify the type. Unfortunately this requires that we read the entire zip into memory, if I recall correctly the content table is stored at the tail of the zip so it is unavoidable. For CompoundFileBinary
(legacy Office files) we pass the stream off to another library, which I believe needs to read it to the end but I would need to check.
For other formats it shouldn't read the entire file, it should iterate through the all the formats and discard the ones which do not match (many of these will be discarded after reading the first byte or two) until it finds a match.
Closing - answer as per above.
Based on the code it looks like the entire file is read into memory, rather than the just the file header.
https://github.com/neilharvey/FileSignatures/blob/be7656778addb83042a585fec05e8e6254b9bc72/src/FileSignatures/FileFormatInspector.cs#L89
Is that correct?
Since you know the max length of all headers:
https://github.com/neilharvey/FileSignatures/blob/be7656778addb83042a585fec05e8e6254b9bc72/src/FileSignatures/FileFormatInspector.cs#L71
Would it be possible to simply only read the part of the file that is required for the identification?
For really large files, reading the entire file, when you only need the first few bytes is a bit painful.