mmalecot / file-format

Crate for determining the file format of a given file or stream
Apache License 2.0
94 stars 15 forks source link

Media to kind #26

Closed ahmed-masud closed 1 year ago

ahmed-masud commented 1 year ago

This patch provides the ability to map a media-type to a file kind. The use of this is to take external media-type reporters (e.g. libmagic) and map them to the internal Kind ... It will help in further expanding file-format's FileType definitions, and lays a path to add all of the magic types into file-format.

mmalecot commented 1 year ago

Thanks for your PR! I'll review it ASAP.

mmalecot commented 1 year ago

Why map a media type to Kinds and not to FileFormats? What's the point of returning a Vec<Kind>? Shouldn't we return the generic Application kind if a media type is shared across several kinds? Is caching via static necessary? Thanks.

ahmed-masud commented 1 year ago

I'll share the use-case that I am using this for, and also some thoughts I have for a few future PRs to file-format.

The interpretation of Application as both unrecognized, or multi-kind makes it non-idempotent, therefore the vector. I added the caching of the vector for purely selfish reasons :-) ... In my use-case, it's in a very hot code-path and gets called from 100k to 1MM times a second.

More generally though, I'm extracting media-type classifications from multiple sources, and I'd like the file-type Kind categorization to be the canonical source.

Future thoughts: I'll be writing a parser for magic types and have them be native to file-format. Making file-format a drop-in replacement for rust-magic.

The third step is to write a parser that can take in PRONOM DROID signature files (see https://www.nationalarchives.gov.uk/aboutapps/pronom/droid-signature-files.htm) and have file-format use that to determine file-types.

Any how that was the primary reason behind the PR. If you think it's not useful, I am okay with you closing it. I can maintain a fork. :-)

mmalecot commented 1 year ago

Thanks, I understand better now! I'll think about the future file-format roadmap, but I'll take any ideas.

For now, I'm going to leave the PR open and not merge it because moving to a 1.70 MSRV is a bit too early. Keep working on your fork, I'll be keeping a close eye on it!

Thanks for sharing the PRONOM DROID database, it's very interesting because it's quite close to this crate philosophy. On the other hand, the advantage of file-format is that it has its own database of file formats and signatures, which moreover doesn't need to be parsed because it's statically declared via Rust macros. The big difference lies in the media type and extension: file-format always guarantees an extension (generally the most widely used) and a media type (different from application/octet-stream).

One last point, file-format has been designed to have no default features, to have the bare minimum, with no dependencies. I think it would be a good idea to keep this principle for your feature.

If ever there are other things to propose, feel free to make other PRs!

Many thanks, please keep me posted!

ahmed-masud commented 1 year ago

For now, I'm going to leave the PR open and not merge it because moving to a 1.70 MSRV is a bit too early. Keep working on your fork, I'll be keeping a close eye on it!

I agree on the MSRV going to 1.70 is too early — I'll think about coming up with a no_std approach on caching for any future updates.

Thanks for sharing the PRONOM DROID database, it's very interesting because it's quite close to this crate philosophy. On the other hand, the advantage of file-format is that it has its own database of file formats and signatures, which moreover doesn't need to be parsed because it's statically declared via Rust macros. The big difference lies in the media type and extension: file-format always guarantees an extension (generally the most widely used) and a media type (different from application/octet-stream).

That is what I really liked about file-format. :-) which is why i incorporated it into my project.

One last point, file-format has been designed to have no default features, to have the bare minimum, with no dependencies. I think it would be a good idea to keep this principle for your feature.

The way I was thinking about bringing in DROID was behind a feature-gate that would invoke a build.rs and directly generate the appropriate static file-format code before compiling.

IMHO the DROID format is seems well thought out and well designed. Also, since it's a funded project it keeps up-to-date with modern file formats. I think it should really replace libmagic in *nix, OR that there should be a cross compiler from DROID to magic database,

If ever there are other things to propose, feel free to make other PRs!

Many thanks, please keep me posted!

Thank you :-) happy to share!

Warmest regards,