Magika is a novel AI powered file type detection tool that relies on the recent advance of deep learning to provide accurate detection. Under the hood, Magika employs a custom, highly optimized Keras model that only weighs about 1MB, and enables precise file identification within milliseconds, even when running on a single CPU.
In an evaluation with over 1M files and over 100 content types (covering both binary and textual file formats), Magika achieves 99%+ precision and recall. Magika is used at scale to help improve Google users’ safety by routing Gmail, Drive, and Safe Browsing files to the proper security and content policy scanners. Read more in our research paper!
You can try Magika without anything by using our web demo, which runs locally in your browser!
Here is an example of what Magika command line output look like:
For more context you can read our initial announcement post on Google's OSS blog
-r
for recursively scanning a directory.high-confidence
, medium-confidence
, and best-guess
.For more details, see the documentation for the python package and for the js package (dev docs).
Magika is available as magika
on PyPI:
$ pip install magika
git clone https://github.com/google/magika
cd magika/
docker build -t magika .
docker run -it --rm -v $(pwd):/magika magika -r /magika/tests_data
Examples:
$ magika -r tests_data/
tests_data/README.md: Markdown document (text)
tests_data/basic/code.asm: Assembly (code)
tests_data/basic/code.c: C source (code)
tests_data/basic/code.css: CSS source (code)
tests_data/basic/code.js: JavaScript source (code)
tests_data/basic/code.py: Python source (code)
tests_data/basic/code.rs: Rust source (code)
...
tests_data/mitra/7-zip.7z: 7-zip archive data (archive)
tests_data/mitra/bmp.bmp: BMP image data (image)
tests_data/mitra/bzip2.bz2: bzip2 compressed data (archive)
tests_data/mitra/cab.cab: Microsoft Cabinet archive data (archive)
tests_data/mitra/elf.elf: ELF executable (executable)
tests_data/mitra/flac.flac: FLAC audio bitstream data (audio)
...
$ magika code.py --json
[
{
"path": "code.py",
"dl": {
"ct_label": "python",
"score": 0.9940916895866394,
"group": "code",
"mime_type": "text/x-python",
"magic": "Python script",
"description": "Python source"
},
"output": {
"ct_label": "python",
"score": 0.9940916895866394,
"group": "code",
"mime_type": "text/x-python",
"magic": "Python script",
"description": "Python source"
}
}
]
$ cat doc.ini | magika -
-: INI configuration file (text)
$ magika -h
Usage: magika [OPTIONS] [FILE]...
Magika - Determine type of FILEs with deep-learning.
Options:
-r, --recursive When passing this option, magika scans every
file within directories, instead of
outputting "directory"
--json Output in JSON format.
--jsonl Output in JSONL format.
-i, --mime-type Output the MIME type instead of a verbose
content type description.
-l, --label Output a simple label instead of a verbose
content type description. Use --list-output-
content-types for the list of supported
output.
-c, --compatibility-mode Compatibility mode: output is as close as
possible to `file` and colors are disabled.
-s, --output-score Output the prediction score in addition to
the content type.
-m, --prediction-mode [best-guess|medium-confidence|high-confidence]
--batch-size INTEGER How many files to process in one batch.
--no-dereference This option causes symlinks not to be
followed. By default, symlinks are
dereferenced.
--colors / --no-colors Enable/disable use of colors.
-v, --verbose Enable more verbose output.
-vv, --debug Enable debug logging.
--generate-report Generate report useful when reporting
feedback.
--version Print the version and exit.
--list-output-content-types Show a list of supported content types.
--model-dir DIRECTORY Use a custom model.
-h, --help Show this message and exit.
Magika version: "0.5.0"
Default model: "standard_v1"
Send any feedback to magika-dev@google.com or via GitHub issues.
See python documentation for detailed documentation.
Examples:
>>> from magika import Magika
>>> m = Magika()
>>> res = m.identify_bytes(b"# Example\nThis is an example of markdown!")
>>> print(res.output.ct_label)
markdown
See python documentation for detailed documentation.
We also provide Magika as an experimental package for people interested in using in a web app. Note that Magika JS implementation performance is significantly slower and you should expect to spend 100ms+ per file.
See js documentation for the details.
We use poetry for development and packaging:
$ git clone https://github.com/google/magika
$ cd magika/python
$ poetry shell && poetry install
$ magika -r ../tests_data
To run the tests:
$ cd magika/python
$ poetry shell
$ pytest tests/
Magika significantly improves over the state of the art, but there's always room for improvement! More work can be done to increase detection accuracy, support for additional content types, bindings for more languages, etc.
This initial release is not targeting polyglot detection, and we're looking forward to seeing adversarial examples from the community. We would also love to hear from the community about encountered problems, misdetections, features requests, need for support for additional content types, etc.
Check our open GitHub issues to see what is on our roadmap and please report misdetections or feature requests by either opening GitHub issues (preferred) or by emailing us at magika-dev@google.com.
When reporting misdetections, you may want to use $ magika --generate-report <path>
to generate a report with debug information, which you can include in your github issue.
NOTE: Do NOT send reports about files that may contain PII, the report contains (a small) part of the file content!
See CONTRIBUTING.md
for details.
We have collected a number of FAQs here.
We describe how we developed Magika and the choices we made in our research paper.
If you use this software for your research, please cite it as:
@misc{magika,
title={{Magika: AI-Powered Content-Type Detection}},
author={{Fratantonio, Yanick and Invernizzi, Luca and Farah, Loua and Kurt, Thomas and Zhang, Marina and Albertini, Ange and Galilee, Francois and Metitieri, Giancarlo and Cretin, Julien and Petit-Bianco, Alexandre and Tao, David and Bursztein, Elie}},
year={2024},
eprint={2409.13768},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2409.13768},
}
Please contact us directly at magika-dev@google.com
Apache 2.0; see LICENSE
for details.
This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.