Extended metainformation: title, licenses, etc

GreyCat commented 7 years ago

I propose to add more keys to the meta while we're at it (see #20 and #53) for making it easier to manage a large library of formats. My proposals:

title — something akin to <title>...</title> HTML element — human-readable name of the format, if it exists. If it doesn't, probably we'll just use id. For example, for gif.ksy it should be something like "GIF (Graphics Interchange Format)", for elf.ksy it should be "ELF (Executable and Linkable Format)", etc.
license — a string that contains machine-readable license reference for a format ksy spec, according to SPDX license expressions

More ideas:

some general areas of applications or just keywords / tags — ideally, should stick to some existing well-known classification?
links to format descriptions / IDs on some external websites:
- PRONOM ID, for example GIF 87a
- Apple Uniform Type Identifier?
- MIME type?

I want some peer review of this stuff before I add it. cc @koczkatamas @LogicAndTrick @markbook2?

koczkatamas commented 7 years ago

Great ideas, I like them all!

But make them optional by file format level and make for example the license property a requirement when the author wants to publish the .ksy into the format library.

Also it would be good if we could support file format identification[1] somehow, one of the options is to add file format identification information directly to .ksy (eg. magic numbers). Other is to use a 3rd-party tool to identify the file format and select the appropriate .ksy by MIME type.

Other question: is there any file format which has multiple MIME type / other identifier? For example javascript can be text/javascript and application/javascript... [2]

[1] http://www.forensicswiki.org/wiki/File_Format_Identification [2] http://stackoverflow.com/questions/4101394/javascript-mime-type

LogicAndTrick commented 7 years ago

Some thoughts:

Author information (name, email, website, etc)
KSY version number (if the ksy format is updated/changed you can increase the version)
Creation/modified date
List of external dependencies (e.g. if the format depends on an opaque external type)
Also like the idea of license, mime type, keywords. Magic number sounds like an interesting idea too. (Though I sometimes deal with formats that have more than one magic number)

GreyCat commented 7 years ago

But make them optional by file format level

All that stuff serves only informational purpose (i.e. would be used for formats website generation) and thus has no direct impact on ksc → it would be optional for ksc usage. Getting the format into the format repo is another story: probably we'll want some "extra" fields filled, such as license, title and top-level doc (see #20).

Also it would be good if we could support file format identification somehow, one of the options is to add file format identification information directly to .ksy (eg. magic numbers). Other is to use a 3rd-party tool to identify the file format and select the appropriate .ksy by MIME type.

Technically, we already have magic numbers. You just need to attempt parsing the file — if everything's fine, you've got a parsed file. If not, you'll get unexpected fixed contents exception.

MIME types are generally a huge mess, and probably won't be really useful for anything real, but can be kept for the sake of information references. As you've noted, there are quite a few duplicates, and even worse, there are entries like:

application/x-lha                               lha
application/x-lzh                               lzh

(which actually refer to the very same lzh archives, which, actually, have 3-4 very different versions of the format inside), or:

application/x-msdos-program                     com exe bat dll

(mkay, .dll is totally MS-DOS stuff, and it's absolutely no different from .bat files, right).

Author information (name, email, website, etc) KSY version number (if the ksy format is updated/changed you can increase the version) Creation/modified date

I've seen most docstring formats sporting these fields, and never ever seen sane usage of such fields. All that stuff really belongs to version control — and at most should be embedded into files using some sort of RCS-style keywords. Even from the legal point of view, as far as I remember, it was proven several times in the court that proper version control history takes precedence over such (miskept and rarely maintained) tags.

List of external dependencies (e.g. if the format depends on an opaque external type)

That's a good idea, but probably it should be discussed and implemented separately — as we're really talking about import / include type of statement that should probably affect compiler as well.

GreyCat commented 7 years ago

What about "Apple Uniform Type Identifier"? Anyone seen these used for anything in the wild?

koczkatamas commented 7 years ago

I never seen the Apple Identifier until I opened your link today.

GreyCat commented 7 years ago

Ok, then I propose to add yet another "extensible" point for all that cross-referencing other identifiers and format catalogues stuff. Let's call it xref and use it like that:

meta:
  id: gif
  xref:
    mime: image/gif
    pronom:
      - fmt/3 # http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=619
      - fmt/4 # http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=620
    forensicswiki: GIF # http://forensicswiki.org/wiki/GIF
    wikidata: Q2192 # can be used to find links to all language versions in Wikipedia
    fileformat: http://www.fileformat.info/format/gif/egff.htm
    digitalpreservation.gov: fdd000133 # http://www.digitalpreservation.gov/formats/fdd/fdd000133.shtml
    rfc: ... # use to link to RFCs that describe the format

etc, etc.

ams-tschoening commented 7 years ago

If you already want to support RFCs, how about other official industry standards like ISO, IEEE etc.? And what about if such are not publicly available, like what I'm currently working on?

EN 13757-3 2012 Communication systems for and remote reading of meters - Part 3: Dedicated application layer

I think "EN..." is the official name for the standard in German language, even though the whole text is in English and that name can be found in international sources. Would only be the first line added in your example or the second or both in one line of text however the author sees fit? EN... might not be easy understandable for humans, while with your publicly available examples one could simply read the details of in the web and now what RFC 1234 is about.

GreyCat commented 7 years ago

Generally, referencing some standard document is somewhat complicated matter.

First, as you've pointed out, not all of them are available to public (especially for free, especially in English, especially by querying one standard URL).

Tags like rfc: 1234 are useful, because one can always generate links like https://tools.ietf.org/html/rfc1234 and be sure that it exists and will be always valid. Adding tag like din: EN 13757-3 2012 would be probably helpful for reference (at the very least, it's better than a comment), but I guess we can't just generate a clear link to it. The same, unfortunately, mostly goes for ISO, IEEE, GOST, JIS, CNS, etc. Sometimes they are actually available on the web somewhere, so it's super useful to put a link into the .ksy to spare some time for future fellow researchers.

Second, it's true that it might be a good idea to reference full name of the standard as well. Althought it's not very clear, but it's still better than nothing — it will aid in searching by name, if search by number somehow would fail.

That probably brings us to a yet another nested map solution, something like:

meta:
  xref:
    din:
      id: EN 13757-3 2012
      name:
        de: Kommunikationssysteme für Zähler und deren Fernablesung - Teil 3: Spezielle Anwendungsschicht
        en: Communication systems for and remote reading of meters - Part 3: Dedicated application layer
      url: http://example.com/path/to/something.pdf

However, that's not all of the problems. The actual problem is that implementation needs to reference the standard (or any other format document) in context, i.e. for each type and attribute. Right now I usually just put it into comments, but I have an idea to create a special field to hold these references. Let's name it doc-ref, for example, and it would be something like:

types:
  params_setwindowext:
    doc-ref: section 2.3.5.30
    seq:
      - id: y
        type: s2
        doc: Vertical extent of the window in logical units.
      - id: x
        type: s2
        doc: Horizontal extent of the window in logical units.

What do you think of that?

ams-tschoening commented 7 years ago

I like your approach, just wondering if it's not over engineering things. I guess supporting names in different languages and such will rarely be used, but I see clear benefits in providing human readable document titles in addition to just short standard names and even URLs. Simply because I had such an URL in the past where the PDF was readable on some public web server by accident and Google indexed it. :-)

In theory doc-ref would need to be connected to xref somehow to be unique, because section xy might be available in different mentioned standards/docs/whatever. In my case for example 95% of my work is related to EN 13757-3 2012, while the header of the whole thing comes from EN 13757-4 2013. Both is described in one and the same KSY because the header is pretty small, it made implementation easier and is easier to understand that way for me and such.

GreyCat commented 7 years ago

Names in different languages are, unfortunately, a reality. I've digged around and, as far as I understand:

all DIN standards have both German+English names (and usually have full English translations)
all GOST standards have Russian+English names (but rarely have translations of anything beyond the title)
all JIS standards have Japanese+English names (again, almost never seem to have official full text translations)

etc.

I guess we should just leave doc-ref as arbitrary string and use common sense to specify references there in some human-readable form. I doubt that we'll want to undertake an ambitious conquest to make "a universal system to be able to reference any part of any document on the planet" here.

davidhicks commented 7 years ago

I suggest just placing the Wikidata identifier for each file format, as Wikidata is better placed to store structured information on each file format, i.e. links to documentation, version information, links to PRONOM and other databases, etc. No need to duplicate information that already exists elsewhere.

Example:

meta:
  id: png_1_2
  title: Portable Network Graphics, version 1.2
  wikidata: Q27229642
  ...

In this example, the matching Wikidata item is https://www.wikidata.org/wiki/Q27229642

GreyCat commented 7 years ago

Aww, looks like something eaten up my huge reply ;) So, I'll repeat it then in somewhat terser way.

Basically, I have two huge concerns about using Wikidata:

Our understanding of consitutes a format might be pretty different from the one at Wikidata. For example, Guitar Pro application uses two major (very different) versions of the format: first one is older binary format (.gtp, .gp3, .gp4., .gp5 file extensions), and the second one is newer XML-like format (.gpx). This is probably how we'll be going to support it — i.e. there would be 2 .ksy files. However, Wikidata might want to treat every file extension as individual format (which caters casual user's point of view).
Wikidata has some fairly strict policy of notability, i.e. format must be notable to be included in Wikidata. "Notability" usually boils down to having some 3rd party publications about this format. I'm really not 100% sure that all our formats pass that criteria, and that would put us in a very awkward situation (i.e. we add format to a Wikidata, reference it in .ksy, and eventually it gets deleted from Wikidata due to being non-notable).

Last, but not least, we have not only things that are strictly "file formats", but also:

some parts of file formats (for example, DWARF debugging info that is shared between several executable formats)
network protocols, packets, etc
some EEPROM / firmware layout maps
compression stream formats
etc, etc.

Many of them have extra references, such as RFCs, various international / national standards (ISO, DIN, GOST, IEEE, etc), but they are either out of scope of Wikidata file formats project, or, again, they might be non-notable by itself.

davidhicks commented 7 years ago

On mapping of Wikidata items to .ksy files

The mapping between Wikidata items need not be 1-to-1. See for example https://www.wikidata.org/wiki/Q136218 which describes the ZIP family of formats and includes a mapping to the relevant .ksy file.

On notability

I agree that some items described (EEPROM/firmware images) may not meet Wikidata's notability guidelines as there may be no public information about a chip or firmware. For file formats, network protocols and compression schemes, the vast majority would meet Wikidata notability guidelines as they are tangible items for which existence can be proven by at least one reliable source. Wikipedia has much stricter notability guidelines, and most file formats/network protocols would fail to meet the notability criteria for Wikipedia. However, the rejected file formats/network protocols could still be included in Wikidata as the notability criteria is much looser. RFCs, standards and other documentation should all be notable as far as Wikidata is concerned as they are tangible items whose existence can be proven with reliable sources. Wikidata has thousands of items for scientific papers, so RFCs/standards are no less notable.

GreyCat commented 7 years ago

So, the bottom line is that there are at least some cases when we'd like not to rely completely on Wikidata, so I guess it's worth implementing more complete approach.

Still, probably it's a good idea to implement lookup into Wikidata using API. For records, the simplest query is probably something like:

curl https://www.wikidata.org/wiki/Special:EntityData/Q136218.json

davidhicks commented 7 years ago

Perhaps lookup Wikidata where possible to fill in gaps/fields that aren't specified in the .ksy file?

GreyCat commented 7 years ago

Yeah, exactly :)

GreyCat commented 7 years ago

Finally, I have enable meta/xref as a valid key, without any checks for its contents. I believe we now need a good documentation chapter that lists all supported keys, but probably, for the time being, the only real formal consumer of this information would be formats.kaitai.io build script, so I guess we should start with it.

kaitai-io / kaitai_struct

Extended metainformation: title, licenses, etc #59

On mapping of Wikidata items to .ksy files

On notability