google / magika

Detect file content types with deep learning
https://google.github.io/magika/
Apache License 2.0
7.69k stars 402 forks source link

Hello from the CCCS! 🍁 #56

Open cccs-kevin opened 6 months ago

cccs-kevin commented 6 months ago

We at the Assemblyline project perform our own file identification to ensure files are routed correctly to the corresponding file analysis modules. That is why the magika project is very interesting to us.

We have a set of files used for unit testing that we are confident* in their file type. We ran that set against the magika tool and found some discrepancies: see attached CSV.

All of the SHA256 hashes can be found on VirusTotal, and we would love to collaborate (join our Discord!) to improve magika to the point where we can integrate it into Assemblyline :)

AL_MAGIKA_COMP_revised.csv

Cheers, 🇨🇦

invernizzi commented 6 months ago

Much appreciated! We can add these to our golden dataset and track improvements to the model. Can I assume these are MIT licensed like the rest of Assemblyline?

Btw - I love Assemblyline :)

reyammer commented 6 months ago

Indeed, thank you for taking the time! This is extremely useful. We need to settle down on a bunch of things after this initial release, and we'll then definitively follow up :-) Thanks again!

cccs-rs commented 6 months ago

For sure! If there's anything we can do to help improve the project, feel free to let us know!

cccs-kevin commented 6 months ago

Much appreciated! We can add these to our golden dataset and track improvements to the model. Can I assume these are MIT licensed like the rest of Assemblyline?

Btw - I love Assemblyline :)

Assemblyline itself is MIT-licensed but the hashes in that list should not be assumed to be MIT. They are not owned by Assemblyline and rather are just files found on VirusTotal :)