mmalecot / file-format

Crate for determining the file format of a given file or stream
Apache License 2.0
94 stars 15 forks source link

Support for multi-format files / check against singular format #29

Closed yael333 closed 1 year ago

yael333 commented 1 year ago

Hi 👋 I love this project, such amazing selection of file formats~ I'm not sure if I didn't manage to comprehend the macro sorcery fully but is there a way to check if file contains multiple types (such as various esoteric file polygots )?

If not, just checking for every singular file individually would be great. thank you so much for the help <3

mmalecot commented 1 year ago

Hi,

Thanks for the compliment, I hope you'll manage to understand how these macros work, I'll add more comments in the code in the next version to make everything as clear as possible :).

I didn't know about polyglot files, that's very interesting, thanks for sharing!

Currently, file-format via FileFormat::from_file will return only one file format: the first one for polyglot files.

On the other hand, with FileFormat::from_reader or FileFormat::from_bytes, it should be possible to identify all the formats contained in a polyglot file, if we can determine the beginning of each of them.

Thanks for asking!

mmalecot commented 1 year ago

In fact, it might be necessary to extend FileFormat::from_file so as not to return a format (perhaps return an error, or the generic FileFormat::ArbitraryBinaryData format). Otherwise, the crate could be fooled.

If you think it's possible and useful, we can also imagine a polyglot feature that activates a FileFormat::from_polyglot_file method, which would return several file formats.

In any case, I don't think it's easy to delimit sub-files.

If you have any ideas, I'd love to hear them!

yael333 commented 1 year ago

Thank you so much for the quick and thorough response~ Detecting and working with polygot files is still quite arcane and esoteric, hence why I started this Rust project~

These files usually have overlapped sections as they're not a regular archive file, meaning if you take the same slice of file and run it through and check for signatures it will pass for multiple formats. While also the definition of these files is vague, and usually depends on the validation of an external parser or program (For example most PDF polygots don't follow the official standard but still get opened well on most PDF readers).

Whether you'd wish to support parsing for these files depends on the scope of your program, but if needed I can contribute as well. I'll update about the success of integrating this awesome module into my project <3

mmalecot commented 1 year ago

Yes, please keep me posted! Feel free to open a PR, I'll follow your project!