kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4.02k stars 197 forks source link

An issue to coordinate file formats descriptions creation #134

Open KOLANICH opened 7 years ago

KOLANICH commented 7 years ago

Converters needed:

GreyCat commented 7 years ago

As far as I understand, .doc, .xsl and .msi are all variants of Compound File Binary Format, and I had some premilinary format of it uploaded. I'm not sure if there's some work for KS to dig deeper into them, or one should just build a higher-level accessors in particular programming languages (akin to XML, JSON, protobuf and similar formats)?

KOLANICH commented 7 years ago

Don't know, haven't touched them yet. When I touch, I'll decide.

sem-geologist commented 6 years ago

Hello, I think Edax spc format is allready partly implemented in python, and it could be helpful for full RE. see the io_plugin in hyperspy project https://github.com/hyperspy/hyperspy/blob/RELEASE_next_minor/hyperspy/io_plugins/edax.py

KOLANICH commented 6 years ago

@sem-geologist, I'm not working on edax spc (and any other ksy descriptions) now (I'm a bit busy now and will be busy during New Year holidays), but I have a very unfinished, non-working and yet unpublished ksy file. Do you need it?

sem-geologist commented 6 years ago

I just wanted to point to that there is some known specification of these edax spc files which If you wish could be reused (as it is open sourced) in defining ksy of that filetype. Personally I don't work with edax, as I have no hardware. I had RE Bruker format (which is as well present in the hyperspy project). It took me nearly one year using classical hexeditors, I had to RE single file system (the base of that format) and then internal binary format with crazy delphi array packing (le switching nibbles)... In the end my implementation of python + cython reader of the file is able to read it a few times or even few orders (when compressed) faster than original software. I am wondering if cython would be not good candidate as another language for kaitai_struct.

estan commented 6 years ago

@KOLANICH I'm afraid "probably working on it" is much too strong. I was merely floating the idea to myself, since I was in need of a way to dump the structure of HDF5 files that is better than h5debug from HDF Group, and came across Kaitai. No guarantee that I'll have time to make the tool, nor that I will use Kaitai. I'm OK with being listed here, but "probably working" should be changed to "might work".

estan commented 6 years ago

@KOLANICH I should also mention that we do not need this dumping tool to be complete, only to support certain structures in the HDF5 file. So even if I do get around to do it, it would probably be incomplete (e.g. support only Version 0 superblock, Version 1 B-trees, only certain Object Header Messages and so on).

KOLANICH commented 6 years ago

So even if I do get around to do it, it would probably be incomplete (e.g. support only Version 0 superblock, Version 1 B-trees, only certain Object Header Messages and so on).

Completely OK. Sharing even incomplete works may save time for the ones who need more.

estan commented 6 years ago

Completely OK. Sharing even incomplete works may save time for the ones who need more.

Absolutely, even if this is something I'd do for work, I'm sure my boss wouldn't mind us sharing it under a permissive license. He's encouraging us to do more of that, and a HDF5 parser is one of those things which we absolutely could share. I'll let people know on this issue if I get started on it.

GreyCat commented 6 years ago

@KOLANICH @koczkatamas I wonder if we can raise visibility of this list of WIP formats.

My suggestion is that we can create a special on formats.kaitai.io that will list all these work-in-progress formats in a concise table. To generate that table, we could start a YAML file in formats repo, something like that:

microsoft_cntk:
  title: Microsoft CNTK
  author: KOLANICH
  doc: https://github.com/Microsoft/CNTK  
  git: https://github.com/KOLANICH/kaitai_struct_formats/tree/CNTK
  blocked-by: https://github.com/kaitai-io/kaitai_struct/issues/12345

Minimally, only title and author is needed to note everyone that a certain person is working on some project. All the fields (except for title) would allow either single string value or string arrays, so one can have multiple authors, urls, git repos, etc.

KOLANICH commented 6 years ago

IMHO - for WiP formats https://github.com/kaitai-io/kaitai_struct_formats/network should work fine. I use this page mainly as a wishlist for the things I have not started. I put there links to sources of info. Sometimes I go to this list, pick some format done by noone and start work on it. After I started working on it I mark the format in the list as WiP rather than deleting the item. But the main list of WiP formats is branches of my repo.

So

So to create the list there should be a script

It can also scan GitHub repos and Gists to find the formats out of the ksf repo.

koczkatamas commented 6 years ago

I am creating .ksy files in the Web IDE without cloning the formats repo, so the workflow described by @KOLANICH won't fully suffice my needs.

YAML file looks okay for me though.

KOLANICH commented 6 years ago

I am creating .ksy files in the Web IDE without cloning the formats repo.

Does webide have capabilities to upload and save files on server side?

koczkatamas commented 6 years ago

Does webide have capabilities to upload and save files on server side?

The current one (v1) does not have, but I planned Github integration into the v2. Sadly that won't happen in the near future :(

GreyCat commented 6 years ago

@KOLANICH This is actually a good idea, i.e. I haven't realized that we had so many WiP formats going on, until you've pointed it out. I want to keep things simple, and compatible with all possible development efforts, i.e. not only git, not only github, etc. How about we do it in 2 steps:

  1. YAML-to-HTML generation script
  2. Script to fetch GitHub network info and add that information into YAML
KOLANICH commented 6 years ago

YAML-to-HTML generation script

OK, yaml is only for testing purposes I guess because I guess noone will maintain such a list in yaml format. As an internal cache sqlite may be more suitable. BTW, are you merging my PRs (except gettext_mo, I realized I have forgotten to upload the new version of it)?

GreyCat commented 6 years ago

BTW, are you merging my PRs (except gettext_mo, I realized I have forgotten to upload the new version of it)?

I will.

dgelessus commented 5 years ago

This is a very useful list! Until it gets its own place on the website, maybe this issue should be linked in the README of kaitai_struct_formats? Or even better in a CONTRIBUTING.md, so GitHub notifies users about it when making a PR.

ildar commented 5 years ago

@KOLANICH , I'm working on Bluetooth part of the network/pcap. Anyone did/doing that?

KOLANICH commented 5 years ago

@ildar, I have added you to the list. Could you post here some info, such as a link to your WiP repo?

You also may find https://gitlab.com/KOLANICH/USBPcapOdinDumper useful. It can be used as a boilerplate (it is built upon my framework for writing stacked pipelines) for reverse engineering different protocols for different devices.

GreyCat commented 5 years ago

@KOLANICH Your impressive work list in this issue recently received a lot of attention. I wonder if we could move it somewhere, so people could update it easier?

Any ideas? GH wiki? Git repo + markdown file? Markdown + subpage at formats.kaitai.io?

KOLANICH commented 5 years ago

@GreyCat #129 ?

dgelessus commented 5 years ago

I think a GitHub wiki page on the kaitai_struct_formats repo would be the best option. That way others can easily add links to their implementations or additional reference material for the listed formats, without having to go through the process of making a PR for every small addition.

GreyCat commented 5 years ago

Both are kind of tempting. Issues proposal might indeed bring more value in long-term, as we'll have some place to keep history on every format, not just a big pile of text.

@KOLANICH, are you up to converting your list here to issues? It would be probably lots of manual work?..

dgelessus commented 5 years ago

Hm, interesting. On one hand individual issues would be better organized, and you have a separate discussion space for each format (which also lets people watch/mute individual issues about the formats they care/don't care about). On the other hand, a wiki page gives a better overview and is more flexible to reorganize (sometimes you have related formats where it's not clear right away if they should count as one format or separate ones...), and anyone can edit it freely (as opposed to issues, where anyone can comment, but the top post can only be edited by the author or repo admins).

All of the suggested forms would keep an edit history. The difference is that a wiki page or Markdown file has a more classical file/diff history, whereas issues have both a comment history and an edit history for the individual comments (but it's likely only relevant for the top comment).

For the file/wiki approach it might be a good idea to split the list up into separate files/pages by topic, that way it's organized better and the history is cleaner. On the other hand, categorizing formats is an issue of its own (#572)...

KOLANICH commented 5 years ago

are you up to converting your list here to issues?

Yes I am. Though don't expect it happenning immediately. And in fact it doesn't look like too big amount of work.

The good side of issues not only the space for discussion, but that

  1. it is possible to assign an implementer to them
  2. they can be promoted to PRs and/or closed by PRs.

From the other side, in the list everything is visible within the same page. The issues are to be distributed within multiple pages with large GUI overhead. So, if we moved the list to the issues, we need a script aggregating the list in real time, using issues webhook.

Some ideas about organization. As you see, some formats are grouped by container formats they are based on. So we heed labels for the container formats.

dgelessus commented 5 years ago

If we choose to use issues, IMHO it wouldn't make much sense to mirror them to the website - it would be too difficult to reliably collect the relevant information about each format. The description and reference links would be in free text form in the issue body and comments, so the only information you could extract reliably would be the issue title and tags - which is what the normal GitHub web interface already displays (and provides search and filters for).

KOLANICH commented 5 years ago

Also I would like enforcing the template on the issues in that repo. I mean every issue must request a spec and contain some useful info. And the script should move all the issues not matching the template out of the repo (for example after an hour from the first violation of the template, if it has not been fixed). It may even make sense to make all the issues contain a yaml block with meta and doc and doc-ref boilerplate. So the issues would be both machine-readable and human-readable.

The problem here is that it would require giving a bot moderator permissions in the repo, which means also giving it access to the code, which would mean that we would have to have a separate repo forthe issues, which would cause problems with PR-issue interaction.

GreyCat commented 5 years ago

@KOLANICH, can I actually ask you to pioneer this effort, i.e. creation of issue template and general push towards unification of format descriptions? Making all issues start with a YAML meta block sounds like a very good starting point.

I'll enable issues now, so we can start moving things there.

GreyCat commented 5 years ago

Ok, issues are enabled again: https://github.com/kaitai-io/kaitai_struct_formats/issues

With current GH transfer issue functionality, it still looks like a pretty safe solution anyway.

KOLANICH commented 5 years ago

@GreyCat, could you clean old issues from that repo?

GreyCat commented 5 years ago

I believe we could start with some fixes to CONTRIBUTING.md and may be starting an issue template for something like "new format proposal". Also, please feel free to suggest any better texts to be added to https://formats.kaitai.io/ so it will be cleared for outsiders what they should start with.

@GreyCat, could you clean old issues from that repo?

I don't see any issues in that repo right now. We have tons of PRs, though :(

dgelessus commented 5 years ago

I don't think enforcing a YAML block in all issues would be realistic. Sure, GitHub has issue templates now, but you'd still get many users who don't follow it properly, or who fill it out incorrectly by accident. It's also not a good solution to automatically move away all issues that don't follow the template (that's a great way to discourage people who might be doing good work and simply made a mistake with the template), it would be better to have the script add a label for that instead.

I do agree though that there should be some sort of guidelines/requirements for the formats. Something like:

KOLANICH commented 5 years ago

I don't see any issues in that repo right now.

They are closed, but they are in that repo.

Also we need a bot assigning participants on their requests.

KOLANICH commented 5 years ago

It's also not a good solution to automatically move away all issues that don't follow the template (that's a great way to discourage people who might be doing good work and simply made a mistake with the template), it would be better to have the script add a label for that instead.

That's why a timeout is proposed. User adds an issue -> within few seconds a bot visits and checks -> if it doesn't match the template, the bot creates a comment mentioning tye issue author demanding it to fix the template -> in the case of inactivity the bot deletes the message and recreates it again, this would cause the mentioned person to get a notification -> if the template is not fixed within the deadline, it is removed from the repo.

There is a drawback - it would flood the subscribers with useless messages.

dgelessus commented 5 years ago

Hm, that's another question: once a format is specced and merged into ksf, should the (closed) issues be used for further discussion of the format, for example if the spec needs to be fixed or extended later? Or should that happen on the main kaitai_struct bugtracker?

Alternatively the ksf bugtracker could be used for both normal issues and the format list, with all format list entries having a certain label. That way you can easily filter the issues to only formats/non-formats.

dgelessus commented 5 years ago

if the template is not fixed within the deadline, it is removed from the repo.

What would "remove" mean? Move to another repo? Delete entirely from public view? Why not just tag and/or close the issue, which is the normal procedure for invalid issues?

GreyCat commented 5 years ago

They are closed, but they are in that repo.

What's the problem with them?

User adds an issue -> within few seconds a bot visits and checks -> if it doesn't match the template, the bot creates a comment mentioning tye issue author demanding it to fix the template -> in the case of inactivity the bot deletes the message and recreates it again, this would cause the mentioned person to get a notification -> if the template is not fixed within the deadline, it is removed from the repo.

Sounds pretty complicated. Let's start with just the template and doing dialogues with contributors/suggestors manually? Removing anything from the repo seems really harsh (typically it's done in severe cases like security / privacy exposures), and, in my opinion, even closing the issue because it was just formatted incorrectly looks pretty dire. It's better to have a dialogue than send people away with a bot.

KOLANICH commented 5 years ago

Should formats be allowed that are in common use, but for which no public information or spec exists? That is, should we allow formats that require reverse engineering? I'd vote for no - unless there are already ongoing reverse engineering efforts somewhere that can be linked to (this would count as unofficial reference information).

Of course they should be allowed. If there is a format, there can be a ksy spec. So either there already is a spec, or it should be created. And if it should be created, we probably should track the format, for the case if someone wanted to create a spec, he can just search for the format name in the issues and get all the needed info to start working immediately.

Hm, that's another question: once a format is specced and merged into ksf, should the (closed) issues be used for further discussion of the format, for example if the spec needs to be fixed or extended later? Or should that happen on the main kaitai_struct bugtracker?

Alternatively the ksf bugtracker could be used for both normal issues and the format list, with all format list entries having a certain label. That way you can easily filter the issues to only formats/non-formats.

I have no opinion on this currently. Both cases have own benefits and drawbacks.

What would "remove" mean? Move to another repo? Delete entirely from public view? Why not just tag and/or close the issue, which is the normal procedure for invalid issues?

Because we are misusing issues mechanism. We use it not only as issues, but as a database. As something machine-readable and machine interfaciable. And human-interfaciable the same way - tuat's why we need minimum count of false-positives on search queries. So one issue - one format. An issue is a hub for all the activity around that format. I mean it not necessarily should happen in that issue, but it should be possible to reach it from this issue with minimum effort. Think about issue as about a primary key in a RDBMS or as about a document in document-oriented DBMS.

GreyCat commented 5 years ago

Because we are misusing issues mechanism. We use it not only as issues, but as a database.

I don't see why it's "misusing", it's perfectly fine usage of a ready-made database.

As something machine-readable and machine interfaciable. And human-interfaciable the same way. So one issue - one format. An issue is a hub for all the activity around that format. I mean it not necessarily should happen in that issue, but it should be possible to reach it from this issue with minimum effort. Think about issue as about a primary key in a RDB or as about a document in document-oriented DB.

There are plenty of ways to make it machine-readable. I don't see why removal of issues is better than closing it. Better yet, why not have both: a special label would signify "proper and good" formal format-related issues. Everything else (like reporting bugs, etc) is still perfectly fine. It's not clever to throw away some people's effort in trying to point out some obvious problem to us just because they didn't format it properly.

KOLANICH commented 5 years ago

I don't see why removal of issues is better than closing it.

Because IMHO closed issues have semantics "this format has already been implemented and merged, it is in the repo" and open ones have "there is no currently spec fo this format merged". And any other issues don't fit into this semantics.

It's not clever to throw away some people's effort in trying to point out some obvious problem to us just because they didn't format it properly.

We are not throwing out, just move it into the bug tracker where there is no requirements on issues format, like kaitai-io/kaitai_struct.

Better yet, why not have both: a special label would signify "proper and good" formal format-related issues. Everything else (like reporting bugs, etc) is still perfectly fine.

It may be a good solution: if there are issues about a format, this means it is already in the repo. But there is a problem. The goal of imposing the rule 1 format - 1 issue is to get not more than a single page of search results when searching by format name. I am not sure that this will be the case if the repo will be bloated by issues about the same format. We can filter the search relults by a label, but it is a manual operation, AFAIK GH doesn't allow currently to set the search conditions which are applied by default.

dgelessus commented 5 years ago

The goal of imposing the rule 1 format - 1 issue is to get not more than a single page of search results when searching by format name. I am not sure that this will be the case if the repo will be bloated by issues about the same format.

I'm not sure that this is a good goal. The script for the website might want exactly one issue per format (which follows the template), that can be handled using a label. But I don't understand why that structure would be useful for humans searching the issue tracker, especially for formats that are already merged. If you force all discussion about one format into a single issue, you get one long unorganized discussion that isn't very readable.

KOLANICH commented 5 years ago

If you force all discussion about one format into a single issue, you get one long unorganized discussion that isn't very readable.

Yes, this is one of the reasons I am unsure where we should place the issues about merged formats. We could place them into ksf, but this will result in transforming whole ksf issue tracker into mess (and the reason for it is that ksf is mess itself, for keeping it clean it should not contain any formats at all, each format should have an own repo, but it feels like a large overkill). Or we could place them somewhere else, keeping ksf issues clean, but that would encourage users to transform individual issues in that repo into mess.

GreyCat commented 5 years ago

Because IMHO closed issues have semantics "this format has already been implemented and merged, it is in the repo" and open ones have "there is no currently spec fo this format merged". And any other issues don't fit into this semantics.

Why not just have?

I don't see much point in closing of proposals which do make at least some sense — someone else might pick it up later.

We are not throwing out, just move it into the bug tracker where there is no requirements on issues format, like kaitai-io/kaitai_struct.

Moving issues between repos is a complex & tedious process (i.e. I'm 99% sure it will be a problem to do that with a bot API). Removing or adding a tag is quick & simple.

We can filter the search relults by a label, but it is a manual operation, AFAIK GH doesn't allow currently to set the search conditions which are applied by default.

We can just offer a link for our end users in CONTRIBUTING or from formats.kaitai.io start page to pre-filtered set of results by that label?

KOLANICH commented 5 years ago

We can just offer a link for our end users in CONTRIBUTING or from formats.kaitai.io start page to pre-filtered set of results by that label?

I don't feel like that link will be used by anyone: people need search according to their criteria. So we need an own form. Since we need an own form ... GH search mechanism for issues doesn't fit our needs: it does search based on all words in the text. We need more fine-grained search. An own one hosted on own servers and keeping an own index. Unfortunately I know nothing about search engines, so have totally no idea what can be helpful.

I don't see much point in closing of proposals which do make at least some sense — someone else might pick it up later.

Neither do I.

GreyCat commented 5 years ago

Ok, let me summarize the minimal set of stuff that should be ok to start with:

So, let's start from these small things on manual control? It's always possible to add automation later.

KOLANICH commented 5 years ago

If the work is already undergoing, a link to work repo / gist / whatever with a clear designation like WIP: http://...

Please note, that the second part of https://github.com/kaitai-io/kaitai_struct_formats/issues/137 is also a yaml. I mean that (in the current understanding) the issue should contain 2 yaml blocks, one is a boilerplate and some format metadata, another one is data which is not a part of the boilerplate. After that a free-form text comment.

And we don't need to migrate the list manually now because we have not yet determined the optimal format of the issues. Instead we need to migrate it into an intermediate representation, from which the issues can be easily generated. I guess JSON is fine for the intermediate representation.

My suggestion is just new format.

IMHO spec-needed, and when merging this label is to be removed.

dgelessus commented 5 years ago

(in the current understanding) the issue should contain 2 yaml blocks, one is a boilerplate and some format metadata, another one is data which is not a part of the boilerplate. After that a free-form text comment.

Why not merge both YAML blocks into one?

Also I'm not sure if it makes sense to link WIP specs in the YAML data, because the top comment can only be edited by the issue author. For example, if you have an issue where the author started a WIP spec and never finished it, and someone else later forks it (or makes a new spec), then the second person can only link their spec in a comment, and can't add it to the official list (at least not without asking someone with admin rights to do the edit).

IMHO spec-needed, and when merging this label is to be removed.

I don't have any opinions on the label name, but why would you remove the label when the spec is finished? That just adds an extra step and makes it harder to search for already finished spec issues. You can already filter out finished spec issues by the fact that they are closed.

KOLANICH commented 5 years ago

Why not merge both YAML blocks into one?

Because the first one can be copied as is to get a boilerplate for that format. It must be a syntactically valid KSY and it should be checked automatically.

The second one has completely different schema and should contain the things we don't want to have in a ksy.

Also I'm not sure if it makes sense to link WIP specs in the YAML data, because the top comment can only be edited by the issue author.

and a bot with moderator permissions ...

For example, if you have an issue where the author started a WIP spec and never finished it, and someone else later forks it (or makes a new spec), then the second person can only link their spec in a comment, and can't add it to the official list (at least not without asking someone with admin rights to do the edit).

The bot should check comments and add the properly encoded stuff into the top post. I mean posters can command the bot to some actions on the top post.

I don't have any opinions on the label name, but why would you remove the label when the spec is finished?

Because spec-needed assumes that spec is not yet implemented. For already implemented specs I guess implemented can be used.

That just adds an extra step and makes it harder to search for already finished spec issues.

Having separate labels for implemented and non-implemented makes it easier to search.

You can already filter out finished spec issues by the fact that they are closed.

It cannot distinguish between issues requesting specs, which have lately been implemented (label implemented), and issues in already-implemented specs (label bug and improvement).

dgelessus commented 5 years ago

The bot should check comments and add the properly encoded stuff into the top post. I mean posters can command the bot to some actions on the top post.

This seems like a lot of work with little benefit. Implementing and testing this bot will take a while, and the end result won't be very easy or intuitive to use.

What is the end goal of making this much metadata machine-readable anyway? As I understand it, the machine-readability would only be used to display the list somewhere on formats.kaitai.io. I can understand having the ksy-style YAML block with basic metadata, that is useful data that could be displayed in an overview table. But for anything else I would just link people to the issue page itself, so that they can read the full information themselves. Especially when you have multiple WIP specs for the same format, you probably need the discussion context anyway to understand the differences between the specs.

Because spec-needed assumes that spec is not yet implemented.

That's just down to the label name. You can call it something else ("spec request" or whatever) so that the name makes sense regardless of whether it's been done already or not.

Having separate labels for implemented and non-implemented makes it easier to search.

It cannot distinguish between issues requesting specs, which have lately been implemented (label implemented), and issues in already-implemented specs (label bug and improvement).

I don't understand these points, sorry. Here's the model I was thinking of, with a single "spec request" label (name could be changed obviously). Which part of this is not sufficient or difficult to search for?