Closed GreyCat closed 7 years ago
Great ideas, I like them all!
But make them optional by file format level and make for example the license property a requirement when the author wants to publish the .ksy into the format library.
Also it would be good if we could support file format identification[1] somehow, one of the options is to add file format identification information directly to .ksy (eg. magic numbers). Other is to use a 3rd-party tool to identify the file format and select the appropriate .ksy by MIME type.
Other question: is there any file format which has multiple MIME type / other identifier? For example javascript can be text/javascript
and application/javascript
... [2]
[1] http://www.forensicswiki.org/wiki/File_Format_Identification [2] http://stackoverflow.com/questions/4101394/javascript-mime-type
Some thoughts:
But make them optional by file format level
All that stuff serves only informational purpose (i.e. would be used for formats website generation) and thus has no direct impact on ksc → it would be optional for ksc usage. Getting the format into the format repo is another story: probably we'll want some "extra" fields filled, such as license, title and top-level doc
(see #20).
Also it would be good if we could support file format identification somehow, one of the options is to add file format identification information directly to .ksy (eg. magic numbers). Other is to use a 3rd-party tool to identify the file format and select the appropriate .ksy by MIME type.
Technically, we already have magic numbers. You just need to attempt parsing the file — if everything's fine, you've got a parsed file. If not, you'll get unexpected fixed contents exception.
MIME types are generally a huge mess, and probably won't be really useful for anything real, but can be kept for the sake of information references. As you've noted, there are quite a few duplicates, and even worse, there are entries like:
application/x-lha lha
application/x-lzh lzh
(which actually refer to the very same lzh archives, which, actually, have 3-4 very different versions of the format inside), or:
application/x-msdos-program com exe bat dll
(mkay, .dll is totally MS-DOS stuff, and it's absolutely no different from .bat files, right).
Author information (name, email, website, etc) KSY version number (if the ksy format is updated/changed you can increase the version) Creation/modified date
I've seen most docstring formats sporting these fields, and never ever seen sane usage of such fields. All that stuff really belongs to version control — and at most should be embedded into files using some sort of RCS-style keywords. Even from the legal point of view, as far as I remember, it was proven several times in the court that proper version control history takes precedence over such (miskept and rarely maintained) tags.
List of external dependencies (e.g. if the format depends on an opaque external type)
That's a good idea, but probably it should be discussed and implemented separately — as we're really talking about import
/ include
type of statement that should probably affect compiler as well.
What about "Apple Uniform Type Identifier"? Anyone seen these used for anything in the wild?
I never seen the Apple Identifier until I opened your link today.
Ok, then I propose to add yet another "extensible" point for all that cross-referencing other identifiers and format catalogues stuff. Let's call it xref
and use it like that:
meta:
id: gif
xref:
mime: image/gif
pronom:
- fmt/3 # http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=619
- fmt/4 # http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=620
forensicswiki: GIF # http://forensicswiki.org/wiki/GIF
wikidata: Q2192 # can be used to find links to all language versions in Wikipedia
fileformat: http://www.fileformat.info/format/gif/egff.htm
digitalpreservation.gov: fdd000133 # http://www.digitalpreservation.gov/formats/fdd/fdd000133.shtml
rfc: ... # use to link to RFCs that describe the format
etc, etc.
If you already want to support RFCs, how about other official industry standards like ISO, IEEE etc.? And what about if such are not publicly available, like what I'm currently working on?
EN 13757-3 2012 Communication systems for and remote reading of meters - Part 3: Dedicated application layer
I think "EN..." is the official name for the standard in German language, even though the whole text is in English and that name can be found in international sources. Would only be the first line added in your example or the second or both in one line of text however the author sees fit? EN... might not be easy understandable for humans, while with your publicly available examples one could simply read the details of in the web and now what RFC 1234 is about.
Generally, referencing some standard document is somewhat complicated matter.
First, as you've pointed out, not all of them are available to public (especially for free, especially in English, especially by querying one standard URL).
Tags like rfc: 1234
are useful, because one can always generate links like https://tools.ietf.org/html/rfc1234
and be sure that it exists and will be always valid. Adding tag like din: EN 13757-3 2012
would be probably helpful for reference (at the very least, it's better than a comment), but I guess we can't just generate a clear link to it. The same, unfortunately, mostly goes for ISO, IEEE, GOST, JIS, CNS, etc. Sometimes they are actually available on the web somewhere, so it's super useful to put a link into the .ksy to spare some time for future fellow researchers.
Second, it's true that it might be a good idea to reference full name of the standard as well. Althought it's not very clear, but it's still better than nothing — it will aid in searching by name, if search by number somehow would fail.
That probably brings us to a yet another nested map solution, something like:
meta:
xref:
din:
id: EN 13757-3 2012
name:
de: Kommunikationssysteme für Zähler und deren Fernablesung - Teil 3: Spezielle Anwendungsschicht
en: Communication systems for and remote reading of meters - Part 3: Dedicated application layer
url: http://example.com/path/to/something.pdf
However, that's not all of the problems. The actual problem is that implementation needs to reference the standard (or any other format document) in context, i.e. for each type and attribute. Right now I usually just put it into comments, but I have an idea to create a special field to hold these references. Let's name it doc-ref
, for example, and it would be something like:
types:
params_setwindowext:
doc-ref: section 2.3.5.30
seq:
- id: y
type: s2
doc: Vertical extent of the window in logical units.
- id: x
type: s2
doc: Horizontal extent of the window in logical units.
What do you think of that?
I like your approach, just wondering if it's not over engineering things. I guess supporting names in different languages and such will rarely be used, but I see clear benefits in providing human readable document titles in addition to just short standard names and even URLs. Simply because I had such an URL in the past where the PDF was readable on some public web server by accident and Google indexed it. :-)
In theory doc-ref
would need to be connected to xref
somehow to be unique, because section xy
might be available in different mentioned standards/docs/whatever. In my case for example 95% of my work is related to EN 13757-3 2012
, while the header of the whole thing comes from EN 13757-4 2013
. Both is described in one and the same KSY because the header is pretty small, it made implementation easier and is easier to understand that way for me and such.
Names in different languages are, unfortunately, a reality. I've digged around and, as far as I understand:
etc.
I guess we should just leave doc-ref
as arbitrary string and use common sense to specify references there in some human-readable form. I doubt that we'll want to undertake an ambitious conquest to make "a universal system to be able to reference any part of any document on the planet" here.
I suggest just placing the Wikidata identifier for each file format, as Wikidata is better placed to store structured information on each file format, i.e. links to documentation, version information, links to PRONOM and other databases, etc. No need to duplicate information that already exists elsewhere.
Example:
meta:
id: png_1_2
title: Portable Network Graphics, version 1.2
wikidata: Q27229642
...
In this example, the matching Wikidata item is https://www.wikidata.org/wiki/Q27229642
Aww, looks like something eaten up my huge reply ;) So, I'll repeat it then in somewhat terser way.
Basically, I have two huge concerns about using Wikidata:
Our understanding of consitutes a format might be pretty different from the one at Wikidata. For example, Guitar Pro application uses two major (very different) versions of the format: first one is older binary format (.gtp, .gp3, .gp4., .gp5 file extensions), and the second one is newer XML-like format (.gpx). This is probably how we'll be going to support it — i.e. there would be 2 .ksy files. However, Wikidata might want to treat every file extension as individual format (which caters casual user's point of view).
Wikidata has some fairly strict policy of notability, i.e. format must be notable to be included in Wikidata. "Notability" usually boils down to having some 3rd party publications about this format. I'm really not 100% sure that all our formats pass that criteria, and that would put us in a very awkward situation (i.e. we add format to a Wikidata, reference it in .ksy, and eventually it gets deleted from Wikidata due to being non-notable).
Last, but not least, we have not only things that are strictly "file formats", but also:
Many of them have extra references, such as RFCs, various international / national standards (ISO, DIN, GOST, IEEE, etc), but they are either out of scope of Wikidata file formats project, or, again, they might be non-notable by itself.
The mapping between Wikidata items need not be 1-to-1. See for example https://www.wikidata.org/wiki/Q136218 which describes the ZIP family of formats and includes a mapping to the relevant .ksy file.
I agree that some items described (EEPROM/firmware images) may not meet Wikidata's notability guidelines as there may be no public information about a chip or firmware. For file formats, network protocols and compression schemes, the vast majority would meet Wikidata notability guidelines as they are tangible items for which existence can be proven by at least one reliable source. Wikipedia has much stricter notability guidelines, and most file formats/network protocols would fail to meet the notability criteria for Wikipedia. However, the rejected file formats/network protocols could still be included in Wikidata as the notability criteria is much looser. RFCs, standards and other documentation should all be notable as far as Wikidata is concerned as they are tangible items whose existence can be proven with reliable sources. Wikidata has thousands of items for scientific papers, so RFCs/standards are no less notable.
So, the bottom line is that there are at least some cases when we'd like not to rely completely on Wikidata, so I guess it's worth implementing more complete approach.
Still, probably it's a good idea to implement lookup into Wikidata using API. For records, the simplest query is probably something like:
curl https://www.wikidata.org/wiki/Special:EntityData/Q136218.json
Perhaps lookup Wikidata where possible to fill in gaps/fields that aren't specified in the .ksy file?
Yeah, exactly :)
Finally, I have enable meta/xref
as a valid key, without any checks for its contents. I believe we now need a good documentation chapter that lists all supported keys, but probably, for the time being, the only real formal consumer of this information would be formats.kaitai.io build script, so I guess we should start with it.
I propose to add more keys to the
meta
while we're at it (see #20 and #53) for making it easier to manage a large library of formats. My proposals:title
— something akin to<title>...</title>
HTML element — human-readable name of the format, if it exists. If it doesn't, probably we'll just useid
. For example, forgif.ksy
it should be something like "GIF (Graphics Interchange Format)", forelf.ksy
it should be "ELF (Executable and Linkable Format)", etc.license
— a string that contains machine-readable license reference for a format ksy spec, according to SPDX license expressionsMore ideas:
I want some peer review of this stuff before I add it. cc @koczkatamas @LogicAndTrick @markbook2?