metaeducation / ren-c

Library for embedding a Rebol interpreter into C codebases
GNU Lesser General Public License v3.0
128 stars 27 forks source link

ENBIN + DEBIN draft/sketch implementation #1061

Closed hostilefork closed 4 years ago

hostilefork commented 4 years ago

Rebol2 had an asymmetrical sense of conversion for BINARY! and INTEGER!

rebol2>> to binary! 32
== #{3332}  ; #{33} is ASCII "3", #{32} is ASCII "2"

rebol2>> to integer! #{3332}
== 13106  ; 0x3332 is the internal big endian form of 13106

R3-Alpha "corrected" this unusual behavior by changing TO BINARY!. The conventional wisdom seemed to be that TO BINARY! TO STRING! 32 was not a common desire, but easy enough to express if you wanted it. The harder conversion for users (which was "easy" for Rebol to do) involved the internal byte representations of native C integers:

r3-alpha>> to binary! 32
== #{0000000000000020}  ; 0x20 is 32 in hexadecimal

r3-alpha>> to integer! #{0000000000000020}
== 32

While this might seem more sensible on the surface, it was awkward for users to get the INTEGER! <> BINARY! conversions they wanted...where details of signedness, byte-size, and byte ordering vary considerably. Users might want #{FF00} to mean little-endian 255, big-endian unsigned 65280, or big-endian signed -256. Getting these results from a fixed 8-byte signed big-endian conversion was hard and error prone.

An added concern is that Ren-C's philosphy is that a futureproof language at Rebol's level of concern should always work with integers that can scale to arbitrary-precision. The goal is to finesse the nature of immutable and mutable INTEGER! in the design so that when immutable integers fit in a value cell, they can be operated on efficiently...then only mutable integers would need BigNum "nodes" pointed to by the cell. If care is taken to make immutability the common case, this could provide the best of both worlds.

Yet when Ren-C tried out an unusual way to make TO BINARY! of an INTEGER! return a variable number of bytes, that didn't wind up helping much. This commit introduces two new tools--tentatively called ENBIN and DEBIN--which take a BLOCK! describing the encoding or decoding of a binary to integer that is desired:

>> enbin [be + 4] 32
#{00000020}  ; big-endian, 4 byte, unsigned

>> enbin [le + 2] 32
#{2000}  ; little-endian, 4 byte, unsigned

>> enbin [le +/- 3] -2
== #{FEFFFF}  ; little-endian, 3 byte, signed

>> enbin [le + 3] 16777214
== #{FEFFFF}  ; little endian 3 byte, this time unsigned

>> enbin [le +/- 3] 16777214
** Error: 16777214 aliases a negative value with signed
       encoding of only 3 bytes

DEBIN is ENBIN's complement which uses the same dialect but goes the other way:

>> debin [le + 3] #{FEFFFF}
== 16777214  ; reverse of above example

The dialect choice was made with the endianness first to possibly suggest compatibility with WORD!-based codecs and ENCODE/DECODE if they took a block (encode/decode are nicer words then enbin/debin, but this is just a test for now so not conflating them). The sign is in the middle because unlike ENBIN, DEBIN can guess the size accurately from the length of the input.

This is a prelude to possibly walking back the TO BINARY! of an integer semantics to Rebol2 and making TO INTEGER! of a BINARY! reverse that.

https://forum.rebol.info/t/1270

hostilefork commented 4 years ago

I hope everyone is doing well (!)

While it's hard to tear one's attention away from the news, this is somewhat more comforting I think: it's a sketch of a dialect for making a common annoying task significantly less of a pain.

While I think the tool is rather important, I'm not in love with the specifics of dialect...so I really would like ideas. I just didn't want to have this be a wordy refinement-based thing that got confusing too, e.g.:

 enbin/num-bytes 8 8  ; which is the value, and which is the number of bytes?

Plus for starters, I wanted to avoid defaults. I don't really know that I've seen anything suggesting that 8 is a good default for how many bytes to encode something to as a binary (least common, really)...and big endian and little endian usages both seem to occur about equally in Rebol code. Unsigned--however--seems a lot more common than signed for most tasks.

My choice to use a PATH! for + vs. +/- was kind of "cutesy". But my first idea was to use - for "maybe negative" and I found it less communicative.

Originally I put the sign in the third slot after the size, so it didn't look like an addition: enbin [be 4 +] as opposed to enbin [be + 4]. But then I realized DEBIN could infer the size (if you want it to) by the number of bytes in the binary input. So I shuffled it to make it so that if you dropped the size the signedness was still in the same place.

Of course, the dialect could be unordered...though as I mention, I'm wondering if this is a step toward how codecs might be parameterized. If ENCODE is the means by which Rebol "things" are turned to BINARY! and DECODE turns them back, this looks like it could be the job of a "codec". Why not have the "big endian" and "little endian" codecs? If so, it seems the codec name might should be in the first slot by convention.

(Note: The dialect implicitly COMPOSEs...so any GROUP!s will be calculated. This helps when any of the things are in variables, like the number of bytes. :-/ It's one of the disadvantages of using dialects instead of ordinary calculated parameters and refinements, to have to figure out whether to sacrifice parentheses for this purpose.)

Anyway, my goal today was to make a working implementation to clean up some junk. Just because it's used in the system now instead of that junk doesn't mean you have to use it yet. I've reimplemented the old TO-INTEGER/UNSIGNED in terms of DEBIN so the shared logic isn't repeated, and nothing has changed yet.

But I'd encourage everyone to weigh in. Since it could likely change, try abstracting it... look what I did in rebzip to mitigate risk... But it shows how getting this right will really help.

If people chipped in on the forum post I'd also appreciate that.. Something to do while indoors. :-)

Be safe!

NOTE: I'M PUSHING THIS BUT PLEASE GIVE FEEDBACK AS YOU CAN; EXPECT IT TO CHANGE IF YOU START USING IT, CREATE THOSE WRAPPERS LIKE I SUGGEST SO COPING WITH CHANGES WILL BE EASIER.