Add more "basic" tests samples to cover supported content types

reyammer commented 2 weeks ago

The new model "standard_v2_0" supports 200+ content types: https://github.com/google/magika/tree/main/assets/models/standard_v2_0/README.md

Ideally, we have at least one "basic sample" for each of the supported content types (See /tests_data/basic/*).

This issue acts as a call for action -- external help is very welcome!

Important aspects to keep in mind:

Content types for which we have no samples yet should be prioritized. Among these, prioritize more common content types rather than niche ones.
The "basic" test samples (in the tests_data/basic/<content_type>/*) are supposed to be "easy to recognize". In other words, the goal for these samples is to check that the model does a reasonable job with clear-cut samples, rather than corner-cases.
It's OK to group a bunch of test cases in a single PR.
The PR should state the origin of each sample.
The samples should NOT be taken from existing projects / online resources (in these settings, it would be very challenging to properly document the origin of these files); they should be manually written/created by the PR author.

mamamia96 commented 2 weeks ago

I'd like to add a handful of basic tests for:

pickle
powershell
ttf
gif

reyammer commented 2 weeks ago

These would be very welcome! As indicated in the issue, please include a description on how these files were created (especially for the binary ones, such as pickle). Examples on how we created some of the test cases: create a new google doc, then "export as" various formats. Thanks!

mamamia96 commented 2 weeks ago

Where should I include my description of how I created the files?

mamamia96 commented 2 weeks ago

Where should I include my description of how I created the files?

Sorry I reread the issue and see it should be included in the PR now

google / magika

Add more "basic" tests samples to cover supported content types #662