google / magika

Detect file content types with deep learning
https://google.github.io/magika/
Apache License 2.0
7.55k stars 395 forks source link

Define a more managable workflow to keep track of requested new content types #152

Open reyammer opened 4 months ago

reyammer commented 4 months ago

There are many content types that Magika does not support, and github issues about new requests are piling up (thank you @MichaelHinrichs & others!). While these issues are very valuable, especially when they contain information about the file format itself and pointers on where we could find samples, they are making it quite difficult for newcomers to find open issues.

To address this issue, this is what I have in mind: 1) keep asking the community to open github issues as a first step, so that we can have an initial discussion. 2) maintain a low-overhead sort-of knowledge base (KB) as part of magika docs to keep track of the currently supported and unsupported content types. This would help us having a clean overview on what's going on, with the relevant information properly systematized.

This is how the KB could look like: 1) main "supported content types" markdown file with a table of content types. First we list the currently supported, then we have a table with currently unsupported. 2) For each of the currently unsupported content types, we have: a) tags to indicate metadata, e.g., their importance (supporting Dockerfile is much more important to support some very niche configuration file of some old game), b) optionally, a link to an additional markdown file for that specific content type (which would contain all the useful info provided in the github issue).

For the moment, for those new content type requests for which we have all the info, I'll close them and add a new label "content type to import in KB" so that we can keep track of them. Once we define the format of the KB (or we find an alternative solution), then we we go through them an import everything.

Thoughts?

@invernizzi @MichaelHinrichs?

MichaelHinrichs commented 4 months ago

Sorry, if my issues were overwhelming yesterday. Autism makes it hard for me to know when to stop. Also I don't have a job, and have nothing better to do all day.

MichaelHinrichs commented 4 months ago

I would be happy to implement my issues myself, if I had some instruction on how to do it, with all the necessary file changes and additions.

reyammer commented 4 months ago

No worries at all, having the list and context is super helpful :-) I just wanted to provide more context, the fact I'm closing them does not mean they are not helpful! I'll keep tagging them as before and I'll get to it for next iterations. Thanks again!

dbohdan commented 4 months ago

My thought when I read this issue is that, besides the KB, you could create a separate GitHub repository like "magika-content-types" just to use its issue tracker for content type requests. Users could make and you could discuss and track the requests there. You would then not have to close new-content-type-request issues quickly to keep the overview of other issues clear. (Which I agree is important!)

MichaelHinrichs commented 4 months ago

How about opening a GitHub project?

reyammer commented 4 months ago

Hey thanks for chipping in.

Some thoughts:

I'd still wait for a couple of days to see how things settle in.

BTW, thanks @MichaelHinrichs for reporting so many types :) I've checked them a bit better, it seems that many of them are quite niche or very fine grained. While I don't want to rule out that one day we'll find a way to support them, it's unlikely in the short term. At the moment, what we are looking for is mostly content types that could be useful for large scale production system (as in "popular content types"), rather than niche file formats, e.g., "quake saved game file format". Another aspect that is a bit out of scope at the moment is tracking very small variations of the same overall content type. E.g., let's say there are "quake saved game v1", v2 and v3... we'd likely have just "quake saved game".

So, for now we are mostly interested in very high level content types. With this clarification, I would expect the number of new issues created on this topic to be significantly less, but let's see...

MichaelHinrichs commented 4 months ago

I made an issue for save files in RPG Maker, and a few versions of BSP maps from Quake, and valve games, but never a Quake save.

MichaelHinrichs commented 4 months ago

The modding/mapping communities for Doom, Quake, and the Source engine are both huge, and dedicated.

dbohdan commented 4 months ago

I think Discussions plus Projects could work as a sort of parallel issue tracker for new content types. As a bonus, the voting feature on Discussions could help estimate the demand for a given content type. (Discussions are not enabled for this repository right now.)

MichaelHinrichs commented 2 months ago

.

reyammer commented 1 month ago

Quick update on this: Discussions would likely be a better workflow, but I want to signal that, despite we may not seem super responsive or there seems to be chaos in the gh issues, the current system is working quite well: we are working on a new major release, and many of the new content types we are planning to support come from the many gh issues opened by @MichaelHinrichs! Our plan would be to wrap up this new release and improve the "submit new content type request" workflow after that, and yes, likely opening up the Discussion tab + document what we look for is a good idea.