Binary syntax - Githubissues

ksdlck commented 12 years ago

Though I know it is likely to involve considerable effort, one of my chief displeasures of late with using Javascript v. Erlang has been Javascript's lack of reasonable binary parsing/packing functionality. Coco has been a great pleasure to use in numerous aspects, and it would be wonderful to see it forging ahead of what seems to be most of the language space in this regard.

Check out the Erlang bitstring docs here for an example: http://www.erlang.org/doc/programming_examples/bit_syntax.html

Though the particular sugar used may not be suitable for Coco's grammar, I suspect a reasonable substitute could be found. Additionally, there exist some Javascript libraries for working with binary around, e.g. https://github.com/substack/node-binary and http://code.google.com/p/bitjs/, which might form the basis of the builtins needed to back the syntax. The former is node-specific, the latter not. My supposition is that the use of Buffers and BufferLists in node would be desirable for performance, and several implementation could be included inline with the necessary platform detection and preference to determine which to use. Clearly a single implementation that works across all platforms is also better than none, and may be acceptable performance-wise.

satyr commented 12 years ago

Am new to this kind of stuff. Can you provide:

use cases
code samples
syntax suggestions
compilation strategies

ksdlck commented 12 years ago

In order to stay neutral, the syntax extensions needn't necessarily be limited to only binary, but also Strings. Essentially what I'm looking for is some kind of structured assignment where the left hand and right hand can incorporate identifiers and metadata. Put in another way, this syntax can be a sort of quasiquoting where the preprocessing function can return some number of identifiers to be initialized or modified within the current context, in addition to operating on identifiers in the current context.

Here is a proposal for my ideal look of the new syntax, which may be modified to become more generically applicable beyond regexp-style destructuring for strings and erlang-style parsing and packing for binary:

# binary
<<timestamp: 4, len: 2, payload>> = apnsData
apnsData = <<timestamp: 4, len: 2, payload>>

# string
/prefix(?middle:[a-z]+)suffix/ = pfxSfxData

And of course, String is not blessed with a corresponding packing, because we already have that in the form of string interpolation.

qqueue commented 12 years ago

How are those examples supposed to desugar? I can sort of see the second one being:

ref$ = /([^a-z]*)([a-z]+)(.*)/.exec(pfxSfxData)
prefix = ref$[1]
middle = ref$[2]
suffix = ref$[3]

But there's a lot of room for interpretation.

ksdlck commented 12 years ago

@qqueue, that's about right; notice, though, that "prefix" and "suffix" are just character sequences that match the data, where "middle" is an identifier for a capturing group. I'd recommend use of XRegExp to implement this functionality, but the desugar is possible with plain RegExp as well. Here's how the desugar might look:

var pfxSfxData = 'prefixsomestuffinthemiddlesuffix';
var ref$ = XRegExp.exec(pfxSfxData, XRegExp('prefix(?<middle>[a-z]+)suffix'));
var middle = ref$.middle;

All of the top level capturing groups are declared as necessary in the current context and assigned. For supporting Coco's existing differentiation between declaration and assignment v. just assignment (i.e. = v. :=), perhaps an analogous syntax could be used, e.g.:

/prefix(?middle=[a-z]+)suffix/ = pfxSfxData   # declaration if not declared in the current scope
/prefix(?middle:=[a-z]+)suffix/ = pfxSfxData  # only assignment

ksdlck commented 12 years ago

The binary example uses length-based parsing and packing, so might look something like:

// <<timestamp: 4, len: 2, payload>> = apnsData
var ref$ = new bitjs.io.BitStream(apnsData, true, 0, apnsData.length);
var timestamp = ref$.readBits(4);
var len = ref$.readBits(2);
var payload = ref$.readBits(apnsData.length * 8 - 6);

The previous uses the bitjs library I mentioned in the first post; unfortunately the library doesn't have functionality for writing a bitstream, but I'm sure you get the idea of how it would be done.

qqueue commented 12 years ago

We don't want to pull in an entire library just to enable some sugar. However, the regex proposal is essentially just named capturing groups, right? Thus, it could desugar like this, without needing a library:

/prefix(?middle:[a-z]+)suffix/ = prefixSfxData
# =>
[,middle] = /prefix([a-z]+)suffix/.exec(prefixSfxData)

Doesn't seem too unreasonable. Concerns:

Will only work with literal regexes
Need to handle .exec(str) returning null for no match somehow.
Might prefer a perl-like str =~ /regex/ syntax instead, since if /regex/ = str looks weird

The binary packing/unpacking seems a bit harder to do simply though. Looking at bitjs, there's a lot of code there. Also, you can get pretty close to your proposed syntax already with destructuring and a hypothetical library that unpacks to objects/arrays, e.g.:

{timestamp, len, payload} = myBitLibrary.unpacker('timestamp: 4, len: 2, payload')(apnsdata)
# or
[timestamp, len, payload] = myBitLibrary.unpacker(4, 2, '...')(apnsdata)

Without either a simple compilation strategy or massive performance/readability gains, the binary syntax is a tougher sell.

ksdlck commented 12 years ago

Yep, that regex desugar seems fine (minus the usual undiscovered problems, of course ;). Sort of a double desugar too, which is fun. I think checking for a match could be either checking that one of the named variables is not undefined, checking the result of the statement is not null/false/undefined, or both, leaving it up to the programmer to determine which makes sense in the context. I do like the way it looks now, though; it's good to keep assignment going from right to left as it is for everything else.

Regarding the binary stuff, I agree, it's a bit more difficult, not only to do generally speaking, but specifically to do cross-platform. What might be reasonable would be a more generic syntax for this sort of stuff. Maybe what I've really been thinking about is Coco macros.

qqueue commented 12 years ago

Here's a more complete desugar proposal for "regex destructuring":

/prefix(?middle:[a-z]+)suffix/ = prefixSfxData

var middle, ref$;
(ref$ = /prefix([a-z]+)suffix/.exec(prefixSfxData), ref$ !== null && middle = ref$[0], ref$);

Like the other destructuring syntaxes, the result of /regex/ = expression is the return of /regex/.exec(expression). this allows if /regex/ = str to work as expected. Assignment is also guarded against a null reference.

You make a good point that this syntax mirrors the other destructuring assignments better than =~ would. However,, I can still see it being useful for times when you don't need named capturing groups, or the regex is defined elsewhere, e.g.

regex = build-complicated-regex!
if str =~ regex
  stuff ...

In other words, str =~ regex is simple sugar for str.matches(regex). That syntax probably warrants a separate proposal though.

Re: the binary syntax, the general case would indeed be difficult, since node has buffers and typed arrays. Any sort of syntax-level binary packing/unpacking in coco probably needs a matching, standardized, widely-used binary literal in vanilla JS before being feasible.

We'll have to see whether binary literals in JS or macros in coco happen first :) Meanwhile, maybe you could try sweetjs.

ksdlck commented 12 years ago

Definitely, this proposal and the proposal for =~ are quite separate. Of course, if you're trying to match against a string literal, you can just replace your quotes with // for the same effect using the syntax suggested--beware of literal +*. etc. that have regex significance, of course.

I think a binary syntax is quite feasible using runtime platform detection. For performance reasons, node users will prefer buffers to typed arrays, but including two function bodies instead of one is not a major issue.

qqueue commented 12 years ago

Feasible, yes, but it also needs to be fast, and since you're trying to unpack binary bits, I imagine speed is of the essence. If a <<width: 2, height: 2>> = data syntax exists, it needs to be as fast as:

var width, height, ref$ 
ref$ = data
width = ref$ & 3
height = ref$ >>> 2 & 3

or nobody will want to use it. Runtime detection of array types certainly won't help much there.

However, the above snippet shows that if you can assume to be operating on 32 bits or less (as a Number), compilation without helper functions is feasible. You lose some flexibility with non-aligned unpacking, but it's certainly faster and easier to implement.

ksdlck commented 12 years ago

If we have foreknowledge of the values used in the specifier lengths, e.g. your <<width: 2, height: 2>>, where the bit-widths of the two components are fixed, then the compiler can optimize this case. When the widths become variable, the compiler has to select a more general solution. Ideally, this should not be much duplicated effort, because non-special-cased algorithm can essentially be a collection of the special-cased algorithms with runtime checks for the same things that could be checked at compile-time.

I think the real issue here is that we want to support both typed arrays and Buffers, any thoughts on this?

qqueue commented 12 years ago

The only binary unpacking I've ever done with JS are the packed bytes of the GIF file format, which are byte-aligned, fixed-width fields. Do you come across variable width and/or non-aligned bit-level unpacking that often?

After more research, I found out that both typed arrays and buffers support regular array-like indexing buffer[0], array[1], so I actually don't think run-time detection is necessary, as long as you can fit your packed fields into 32 aligned bits.

ksdlck commented 12 years ago

Agreed; that sounds like a good solution. Can we always count on typed arrays/buffers having a 32-bit word size, or is it system dependent?

qqueue commented 12 years ago

Node Buffers are indexed by bytes, and ArrayBuffers are indexed by whatever ArrayBufferView is used. Uint8Array would match buffer behaviour there.

What I was kind of imagining for this syntax is you'd use "vanilla" JS to get to the 32 or fewer bits you want to unpack, and then use new syntax to sugar over the needed bit-shifts, e.g:

# gif-data is a Uint8Array or Buffer 
<<gct: 1, resolution: 3, sort: 1, gct-size>> = gif-data[10]

<<larger: 10, packed: 10, field: 12>> = buf.readUInt32BE(offset)
# or the same with a Uint32Array view.

These would compile to bit-shifts and masking, assuming 32 bit, big endian.

After looking back at the Erlang bit syntax documentation, I can see the appeal of having larger bit-unpacking than just 32 bits though, e.g. IP packets. That would then have to make assumptions about array field bit-length, endianness, and so on, which is more difficult to compile efficiently. If we can assume big endian (js Number default), a 8 bits, you can still compile to bit operations, but it's less flexible/efficient. One solution is to specify the sub-element length in bits somewhere in the syntax:

 # assume 32 bit unsigned fields
<32><?IP_VERSION:4, HLen:4, SrvcType:8, TotLen:16, 
      ID:16, Flgs:3, FragOff:13,
      TTL:8, Proto:8, HdrChkSum:16,
      SrcIP:32,
      DestIP:32, RestDgram/binary>> = data

#=>

IP_VERSION = data[0] & 15
# and so on

With a default of 8 to make Buffers "just work". Pretty ugly looking, though.

Another concern is how to efficiently make arrays out of the RestDgram part. You can apply a regular [].slice to an ArrayBufferView, but that has to transform all the data into a less efficient JS Array. The performant way would be to run .subarray(start,[end], or .slice(start,[end]) on Buffers. More runtime detection or kludges would be needed to work around all of this, unfortunately.

I've personally only ever dealt with ArrayBuffers in browser JS myself. Do you know if there are any moves to standardize Node/browser binary formats to just one API, even as just a proposal in ES6/ESnext? Dealing with just ArrayBuffers or just Buffers would make this more feasible.

ksdlck commented 12 years ago

About the future, http://blog.nodejs.org/2012/06/25/node-v0-8-0/. That said, buffers are pretty ingrained in the source of node as of now, so we'll see if this transition makes it sooner rather than later. It's possible that this feature could be implemented in such a way as to support Buffers and typed arrays as described thus far, and then the Buffer code could be pruned later, or that the feature can be implemented just for typed arrays and become more useful as node transitions.

satyr / coco

Binary syntax #174