[Question] Extending mime-type detection.

mscdex / mmmagic

An async libmagic binding for node.js for detecting content types by data inspection

MIT License

619 stars 50 forks source link

[Question] Extending mime-type detection. #127

Closed jamespedid closed 6 years ago

jamespedid commented 6 years ago

Hello again,

I'm looking to use this library as part of an amazon lambda job for detecting that uploaded files are the correct type.

I've been successful and deploying a simple lambda job that will detect the mime-type of a file uploaded to it for some cases, but in other cases I receive more generic variants, such as 'application/octet-stream' and 'application/zip' for various legacy and current ms office formats. When I run these files using mmmagic locally, the current ms office formats come back correctly. (The legacy ones still have trouble.)

From what I can tell, it's possible to customize libmagic to use custom detectors as part of a /etc/magic file, but I'm unsure how that works or if this library is capable of utilizing this. Does the library support this, and if so is it possible to customize these in a more lambda-friendly way?

mscdex commented 6 years ago

You can use your own magic file by passing the path to the constructor as shown in the documentation.

jamespedid commented 6 years ago

OK I've looked at the documentation and tried using the magic.mgc file from my local in my aws lambda function; I copied the file from /usr/share/misc/magic.mgc into my lambda zip distribution. Then I referenced the file from the lambda function and it didn't work.

So a few questions:

1.) Should detecting various microsoft formats work out of the box with the bundled magic.mgc library? If so, why am I seeing different results on lambda? (Not necessarily asking you to investigate, but just maybe provide a guess?)

2.) If not, then why can my local use the library correctly to detect various ms office formats, but when I transfer this to lambda, the values muck up?

One last note: using false for the parameter leads to even worse results, where PDFs aren't even detected and the application/zip files are now instead being application/octet-stream. Is there something that I can look for that might interfere with this library, to your knowledge?

mscdex commented 6 years ago

The one thing that will be necessary if you want to use your own magic file is it has to be generated using the same version of libmagic/file (especially for compiled magic files). So as of this writing that would mean libmagic/file 5.32.

As to why you are getting different results, perhaps different versions of mmmagic are being used?

jamespedid commented 6 years ago

Do you know of any way that I could tell if the same version of libmagic is being used or not? The npm package is the same, so I would expect it to use the node bindings that are part of the mmmagic library, and shouldn't this always be the same version?

I'm perfectly fine using the mmmagic file of the library, but it doesn't appear to be working on lambda correctly, and I'm unsure why.

EDIT: I'll note that the way the file is being loaded into lambda is through api gateway, which is effectively delivering the file as an http request. This doesn't seem like it should matter, given that the mmmagic library looks at the bytes of a file, but also mentioning this.

jamespedid commented 6 years ago

Ok I figured out the issue. Lambda was receiving the files as text and not binary, and there was some additional configuration information needed to force the file to processed correctly.

For those who are interested, I had to add a binary media type to force lambda to return the file as a base 64 encoded string. Then I had to decode the input body into a buffer in node from a base64 string before passing it into the magic library.