Proposals and request for comment on extending the url hash format

rawles / edit.tf

An in-browser editor for teletext frames.

http://edit.tf/

GNU General Public License v3.0

92 stars 19 forks source link

Proposals and request for comment on extending the url hash format #60

Open ZXGuesser opened 8 years ago

ZXGuesser commented 8 years ago

As requested by @rawles I'm presenting my suggestions for extending the hash format here for discussion.

This is a going to be long and rambling post, so the TL:DR; is I'd like to store more colon-delimited data on the end of the hash string :)

The way I see it there are three things that it would be useful to add to the hash format, in descending order of usefulness to edit.tf users:

Page header data, i.e. page number, subcode, and control bits
Navigation data
Page enhancement data

The hash string already uses the ":" character as a delimiter so I propose also using this to delimit optional data after the current base64 hash of the page data, as that makes slicing the string simple. This would cause an issue for @peterkvt80 though as wxted's import code searches for the last colon in the string to locate the base64 encoded page data from a complete URL.

I envisage a format something like this: #0:<base64 encoded page data>:<optional header data>:<optional enhancement data>:<optional navigation data>:<more optional enhancement data>...

I can think of two ways to support optional substrings. Probably the easiest is for each of these optional fields to have fixed positions, and to create zero length substrings to pad out to the required field. e.g. <edit.ft metadata>:<base64 encoded page data>:::<navigation data> An alternate method would be to indicate the data type at the start of each substring for example: <edit.ft metadata>:<base64 encoded page data>:<0header data>:<27navigation data>:etc. I prefer the first one because I think the second will waste more bytes for little benefit and requires extra parsing.

For the substrings themselves I don't yet have a firm proposal for a format. The page header data comprises 35 bits which if base64 encoded will fit in 6 characters. As just hex nibbles they use 9 characters and are somewhat human readable and easier to parse, so it's a trade off between readability and saving three characters in the length of the hash string. The last three control bits of the page header encode the national option subset character selection. If this substring is not present, then a best guess of the character set should be made for these based on the character set code in the edit.tf metadata substring.

The format of a packet 27/0 (for editorial linking) comprises six links, a link control nibble, and a 16 bit CRC checksum. Only one byte of the link control nibble is used, as a flag. Each link is six nibbles encoding the page number, subcode, and magazine. The magazine is encoded relative to the magazine number in the page header packet. I suggest a hash format should store absolute magazine numbers for links to avoid recalculating them when changing the page number. I don't think there is any reason to store a page checksum as like the relative magazine numbers it would need constantly recalculating. This could be calculated one time if exporting a 27/0 packet, though the page checksum isn't actually required and exporting to tti files will usually want to use a FL line for linking instead of a packet 27 anyway. (exporting packet 27 is only necessary if any link subcode is not equal to 3F7F, or the link control bit is clear)

I'm a bit vague beyond this point. I would like to be able to extend the linking packet data to support compositional linking in presentation enhancement applications if needed, but this would only necessitate making the navigation substring longer with more data. If software reads the string in and just modifies data at the appropriate offsets it would be forwards compatible with further additions. The same goes for the substrings themselves. Software should keep the hash around after importing it and just modify the substrings it needs to when it updates or re-outputs the hash.

The remaining problem is page related enhancement data which won't be of any immediate interest to edit.tf. This like anything to do with higher level teletext is annoyingly complicated. Packet 26 is simple in that it is a designation code nibble followed by thirteen data triplets (18 bits of data each) and there can be 0 to 15 such packets. The packets must have sequential designation codes, so those can be omitted in the hash substring and all the data run together in one block. To further save space unused triplets following a terminator triplet can also be omitted so that the hash doesn't waste space encoding padding. The triplets being 18 bit means that it compresses into base64 very easily. Packet 28 has multiple types of which 28/0 format 1, 28/1, and 28/4 are of relevance to editorial pages. All are coded as a designation code and 13 triplets as with packet 26. Unlike packet 26 these do not have to be transmitted in order of designation code so if combining them into one sub string I would prepend the base64 coded triplet data with one character holding the designation code to identify them after slicing the substring by length.

The character sets used by the page are encoded in packet 28, along with the three national option subset bits in the control bits of the page header.

pjfdirect commented 8 years ago

No objections here. It would do us artsy designer types a great favour if our work continued to travel well between editors, though :)

ZXGuesser commented 8 years ago

yep, that's exactly why I'm eager to find a solution that will suit everyone rather than just coming up with my own separate format :)

peterkvt80 commented 7 years ago

I'd like to work out something with the subcode so that I can import and export carousels. I'd implement something that can split a carousel into a set of separate pages and back.

On 18:37, Wed, 16 Nov 2016 ZXGuesser, notifications@github.com wrote:

yep, that's exactly why I'm eager to find a solution that will suit everyone rather than just coming up with my own separate format :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rawles/edit-tf/issues/60#issuecomment-261032375, or mute the thread https://github.com/notifications/unsubscribe-auth/AEoa5op9868b7S2_YBH4YIkA9n40IEu8ks5q-02AgaJpZM4K0Cg4 .

ZXGuesser commented 7 years ago

Each subpage would be a separate hash with the page number and subcode encoded in its extended hash so should be easy to do that. If you were importing a hash into a carousel you could just ignore any page number and subcode if present. I'm trying to think of a good solution to dealing with carousels too as at the moment my editor can only export an individual file for each subpage.

rawles commented 7 years ago

Hi there,

Thanks for writing this out as a bug, it’s really helpful. Thanks also for wanting to standardise the URL. Perfect example of our awesome teletext community doing things the right way!

I think part of the problem we’re having here is that when I started out with this editor I didn’t know much about teletext, and most of my experience was with the BBC Master, and therefore the SAA5050 chip. I thought I was being helpful by adding Hebrew and Cyrillic from the data sheet, but I can now see that teletext doesn’t work this way. So let’s imagine this (rather charitably for my sake) from the standpoint of a dual-purpose editor that does SAA5050 frames as well as broadcast teletext frames.

This might mean that for any frame, then the character set chosen is either an SAA one or a teletext one. If an editor relies on the other one, then it just chooses a best guess. Alternatively, edit.tf could just be standardised to use the teletext settings, which is probably better. In any case, we can maybe define higher bits of the edit.tf metadata (field 0) for character set selection from the teletext standard, and I can move over to doing things properly. That is, the high bits are the proper teletext charset description, and the lower bits are the (vestigial) SAA5050 settings/

There’s also a kind of half-arsed feature in edit.tf for storing things like page and subpage numbers, so it would be nice to have that in the URL so I can finish off the feature. I only put it in for export to TTI, and then didn’t finish it.

One issue I have with optional data is that if too many things become optional, we’ll end ip with lots of colons delimiting empty fields. You point that out with the two methods of doing this optional stuff. Do we anticipate later fields needing to be added? If not, this is fine. I think it would help to work on a text file together to define a clear standard for these URLs, and what a valid URL is.

I’ve long had the ambition of also including a URL fragment which says the location of the page on a remote service, so that the editor can just update or load from the current page on a remote teletext server. I think that would be best done with REST, and the optional field would describe a REST endpoint, but I haven’t had time to work this out yet.

We need to consider that there’s no maximum length for a URL but that different browsers have defacto limits. I think at the moment we have about 1200 characters, which seems to work okay, but maybe we should make sure to test the extended URLs with all the browsers we can, or at least look up the length that existing browsers like. I think this is where we’d find carousels infeasible, though I suspect teletext is highly compressible, so small carousels might be possible with a standard compression scheme one day in the future.

I think we’re already using a mixture of hexadecimal and base64, so we should just pick the most readable option, particularly if in hex, bits belonging to the same setting are grouped more naturally.

I agree that magazines should be absolute, and no checksum is needed until these frames make their way to broadcast equipment.

I don’t quite understand ‘enhancement’, or ‘editorial pages’. I am a level 1 sort of person, and haven’t read the standard on what enhancements are possible. Maybe we could thrash out the details above and then consider this advanced stuff.

I’m keen to get Peter’s - @peterkvt80 - full approval on anything here too, so it’s great he’s participating in this thread. Thanks to you again @ZXGuesser for poring over the standards to get this right.

Simon

ZXGuesser commented 7 years ago

Your suggestion of having links to other data in the hash does touch on something I've been vaguely thinking about for my editor in the future but it's a fairly complicated concept relating more to a teletext service editor than a page editor so I'd prefer not to muddy the waters discussing that at this stage. Suffice to say that there are potentially other things that could be usefully stored in the hash in a further extended-extended-hash format if we leave it open ended.

With regard to ending up with lots of optional fields being left empty I was thinking that while everything is technically optional, in practice the page header data is basically a given. If you support the extended hash you're going to be have data to go in the header data substring. It's an optional field but only because the whole extended hash is optional if you see what I mean? If in the future we do want an "extended-extended-hash format" with a lot more new fields then they may need to be given some kind of field ids, but for now I'm only proposing four new substrings to be at fixed positions.

Something just using the current hash format will only know about and create the first two substrings: <editor settings and SAA5050 character set>:<base64 encoded page> An editor that supports an extended hash will have to assume sensible defaults for page number and control bits etc. when loading an old/basic url hash. My feeling is that such an editor is likely to implement page number etc (else why would it need the extended hash support), and so will add that to its hash as a matter of course. So with my suggested scheme where the extra substrings are assigned fixed positions something the bare minimum page with the header data but no navigation or enhancements would like this: <editor settings and SAA5050 character set>:<base64 encoded page>:<page header data> There's no need to have load of trailing colons for any optional substrings as the parser will already have to deal with non existent substrings to import non-extended hashes.

Following my suggestion to place Packet 26 data next, then Packet 27 (navigation) then Packet 28 would mean just a single double colon between the page header data and navigation data for a page with no enhancement packets. To be honest the only reason I propose positioning the extended hash substrings in the order <0>:<26>:<27>:<28> is the obvious OCD pencil straightening one. Going for <0>:<27>:<28>:<26> might be more logical from a frequency of use perspective (it just gives me the irrational urge to start grinding my teeth!).

To keep things to do with character sets in their logical places I would prefer to do what I alluded to above and to more or less leave the present "edit-tf metadata" substring alone. This would then keep its current meaning as an SAA5050 character set when using edit.tf as a 5050 frame editor. For teletext pages the National Option Subset bits in the page header substring and the relevant bits in a Packet 28 would select the teletext character set to use. In the absence of a Packet 28, or support for it in an editor, the default region would be used. Essentially I'm suggesting following the same priority scheme for determining which character set to use as is set out in the teletext standard for real decoders! An editor creating an extended hash using either just the NOS bits, or the NOS bits and a Packet 28 should try to set the SAA5050 value to something sensible for fallback on a software that doesn't support an extended hash. Accepting that this is only going to work properly for a few languages.

I agree that before going much further we should check for any limitations on hash length in hyperlinks from different browsers. I did look this up before creating this thread and it seemed like people were saying length was basically not an issue but I'm sure there must be browser dependent hard limits on the length of the string somewhere. The page data and navigation data isn't going to add much length to the string, but the Packet 26 enhancement data particularly could potentially add a fair bit. Worst case scenario won't double the length of the hash, so it depends what order of magnitude we're talking about limits being. Another potential length issue might come from posting links in various places. For example how long a hash will the url shortners of google, twitter, etc tolerate and so on.

I don't think multiple carousel pages belong in the hash as I think that's starting to get out of scope for what the hash is, and as you say it would start to get truly gigantic. I think we should keep the hash as the way of describing a single page.

To explain what I meant by "editorial pages", I was intending to say any sort of page that a human creates in a page editor intended for humans to read. It wasn't really proper terminology anyway sorry. The Spec calls it a 'Basic Level 1 Teletext page', though that can be slightly confusing because you can add higher level features with additional packets - it just refers to what packets 1-25 are used for. There can be other kinds of "pages" transmitted in a teletext system which don't contain text and are not intended for display. Obviously these are out of the scope of a page editor so any kinds of packets that are only relevant to those don't need to be provided for in the hash format.

peterkvt80 commented 7 years ago

About hash lengths. All browsers should be able to manage 2048 characters so this is barely a page. Chrome has a more generous 2MB limit. However, I might bypass the URL altogether. A different way of operating would be for VBIT2 to act as a server with an interface to upload a subpage at a time. It would start up normally by loading files but it would have the option of accepting updates at the line or subpage level. The idea is to get an editor which can load and publish back to VBIT2.

ZXGuesser commented 7 years ago

Pulling pages from a live server and pushing back updates is certainly something I'm interested in too but I wouldn't want to make a connection to a server a requirement. The ability to bookmark the editor and have everything saved in the URL hash is very handy.

ZXGuesser commented 7 years ago

regarding hash lengths I've written a quick bit of javascript that appends a character to the url hash every millisecond and reads back its length into an element on the html page. I'm at 16000 characters and still counting on firefox.

Edit: Internet explorer craps on everyone's strawberries as usual by limiting it to 2048 characters tsk!

ZXGuesser commented 7 years ago

I learned some time back that as of Windows 10, the address bar character limit on Internet Explorer (and now Edge also) has been raised to the point where it's irrelevant like the other browsers.

Giant ramble ahead - TL;DR: I've started implementing the idea to test what might work, and come up with moderately usable schemes for page header and linking data.

I've been hacking on an experimental implementation of an extended hash based roughly on the suggestions of additional fixed position colon delimited substrings I made previously and finding it hard to weigh up the pros and cons of keeping the data aggressively condensed vs easy to parse. My current experimental code appends the four extra substrings in order to the hash string as above (packet 0, 26, 27, and 28).

The Packet 0 data is encoded as comma delimited hexadecimal values for magazine and page number, subpage number, and control bits with the same bit mask as .tti format (to ease conversion between tti and hash strings as much as anything). Being delimited makes it easy to parse and saves padding the values to fixed length. This is a bit wasteful of space as the whole lot could be packed into 6 characters (saving 1 to 7 characters) if packed in binary and base64 encoded but this would be a pain to decode.

The Packet 27 substring I'm less happy with. What I've got currently is comma delimited packets, the first character being the designation code 0-F. Only DC 0 is implemented so there's only one packet and no actual delimiter used but I kind of want to design the encoding scheme with the possibility of a full implementation in mind. The packet 27/0 then contains (after the designation code) six fixed width hexadecimal link fields in the format MPPSSSS, and finally the link control flag as 0 or 1. Again this could be reduced in size by encoding the links with relative magazine number etc, but for the sake of saving six bytes (or 18 if all base64 encoded together) it really doesn't seem worth all the extra hassle.

I've not done anything with packets 26 and 28 yet though I anticipate them being similar comma separated packets. I feel like the header and linking packets are the priority to figure out as these are presumably all edit.tf and wxted would be implementing initially. The other substrings would just have to be stored and written back unchanged, or left blank for new pages. I envisage that an implementation would split the hash string into an array of the colon delimited substrings however many there are, keep this around and make modifications in the array, then write the whole lot back out colon delimited when updating the hash. That way it's not hard coded to the substrings that it knows and won't lose data updating any hypothetical future extension that it doesn't know about. It's the same case for the csv 'sub-sub-strings' in the packet 27 portion. I would split into an array and locate the element with designation code 0 (which is likely to be the only one) to read and modify, then write back the array elements delimited by commas again into the hash string. It's the logistics of picking out the data to read or modify and storing the rest of the string intact to write back later that I'm struggling with a bit I think. I want to make it not to onerous on the application but without bloating the URL up with too much formatting.

pjfdirect commented 7 years ago

Couple of cheerleading notes (assuming I get it):

As a new user I rarely used TTI but I grew to appreciate there being a text file format I could dig into when necessary, particularly for bits of Teefax admin and appending enhancement packets. Here's one vote for keeping it aboard as the URI becomes more sophisticated. Even if Simon and Peter don't chase the enhancements as rapidly they can probably at least ensure the editors don't hiccup when they get a swig of the future juice.

I'm looking forward to smoothing out the jawlines in my 'Oranges' artwork with a few bonus sixels in enhancement packets. Am I misunderstanding or does this 'unlimited' URI thing dovetail amazingly with the spec saying 'there isn't a limit to the number of enhancement packets?'

ZXGuesser commented 7 years ago

Oh tti import and export will definitely be staying around in my editor, it's the community's de-facto page file storage standard now! An extended hash just enables the same simple sharing and bookmarking of extra data that the basic pages currently enjoy. My aim as you can probably see is to devise a format where an application doesn't have to know or care about the encodings of every portion of data to safely modify it so that support for features can be added as and when required, and above all to be backwards compatible with edit.tf :)

There is a limit to enhancements though. I couldn't tell you offhand exactly what it is when including external object pages, but in terms of what can be stored in a single page and thus is relevant to an extended hash, the limit is either 15 or 16 X/26 packets so there would be a maximum length for an absolutely fully stuffed extended hash string.

ZXGuesser commented 7 years ago

Discussing this elsewhere with @rawles has convinced me that trading url shortness for simplicity is the way to go, and that an extended hash should use key=value pairs.

I realise that I've previously been posting "here is what the data I want to store in a hash string looks like", rather than a "here is how we could store additional data in a hash string" a bit. I want to stress I'm not intending to push any particular vision for a rigid format extension that just suits features I want to add to my editor, though of course that's mostly where my thoughts lie when considering things. My desire is for a hash format that can be extended to contain any data that would be useful to teletext tool authors to store alongside the basic level 1 page hash in way that keeps tools and editors interoperable without discarding any data :)

This then is what I now propose and I'll try to keep it (relatively) short and light on teletext packet implementation musings for once... :)

The extended page hash would consist of colon delimited substrings as before. I still think that leaving the first two substrings of the hash unchanged from the existing format for backwards compatibility of hashes is desirable and see no real benefit to changing them for change's sake, so the first and second substrings for the "SAA505x character set" and "main page data" fields respectively would remain just as they are now as fixed fields. Following that would be any number of colon separated key=value pairs. This is a totally open ended design, where as many or few can be appended as needed and there's no need to include any which contain no data.

Key names should be short to save space but still relevant. Perhaps use PN, SC, and PS for page number, subcode, and page status, since that's the existing convention in .tti files. For pairs representing additional teletext packet data perhaps something like X26 (for all the packet 26 related data), X270 (for fasttext link data), etc. to follow the nomenclature of the teletext specification.

Everything I've said before about reading all the values present into a data structure, modifying or adding fields as needed there, and writing everything back out to a new hash string remains the same, it just would no longer matter what order and position fields (beyond the 'original two') come in.

I did think of a small wrinkle with the colon delimiter whatever format we use. @rawles mentioned the potential for key-value pairs containing URL references which is something I'd also been considering for a vague distant future idea. I'm sure any extended key-value pair that would contain a colon can be worked around easily with standard URL "percent encoding". It's just a hidden "gotcha" worth remembering when implementing code to write or parse a field that it needs to do escaping/de-escaping.

If we agree on this as a basic structure for extending hash strings and all implement the minimum framework to preserve data through import/export in anything that processes hash strings then what key-value pairs to create and deciding on their individual formats is an open ended process and there's no requirement for anyone to implement any of the additional keys if it's not relevant to their software.

From there we can just discuss between whoever has an interest in any particular type of data how those fields can be formatted to best suit the people implementing them.

... so much for keeping it short, doh! ;)

ZXGuesser commented 7 years ago

In the absence of dissent, I've gone ahead and started implementing extended hashes. As a currently hidden feature in my editor, and in edit-tf in the issue-60 branch.

The format is as my previous post for key=value pairs. I have defined the following keys:

PN: three digit hex page number, i.e. MPP
SC: hex subcode number (with leading zeros omitted)
PS: hexadecimal page control bits, in the same format as TTI
X270: fasttext link data in hex. 6 groups of MPPSSSS, followed by the link control nibble.
X280: packet 28/0 format 1 enhancement data. Essentially each row of Table 4 in ETS 300 706 expressed as the minimum number of hex digits.
zx: reserved for private use by my editor

I've not begun implementing packet 26 enhancement packets yet, but I will almost certainly us X26.

I can write this all up as a proper specification for the hash format and extensions all nicely formatted. I guess it should live either in the edit-tf source, or on the wiki here on github so that further key=value pairs can be documented as they arise?

rawles commented 7 years ago

While edit-tf will preserve any key-value pairs it encounters, I don't see it using this additional data for a while yet, so I don't think it's appropriate that the specification's main home is the edit-tf source. If you'd like to use the wiki of this repository, feel free, but maybe since your editor is the only one producing these extended hashes for now, you could prepare and store documentation for it along with your editor and have that as the main home of the spec.

ZXGuesser commented 7 years ago

I can certainly do that if you like, though the patch I've written for edit-tf in addition to preserving the values processes the PN, SC, PS, and X270 data through for the TTI export since that was all already implemented :)

Clarification: only reading them when it encounters them, not creating them.

rawles commented 7 years ago

Thanks for doing that! I'll merge that in when I can.