jneen / parsimmon

A monadic LL(infinity) parser combinator library for javascript
MIT License
1.24k stars 126 forks source link

Loosen parse input constraint. #224

Closed theqabalist closed 6 years ago

theqabalist commented 6 years ago

Would it be possible to loosen the parse constraint to include arrays and buffers so that Parsimmon can be used for more generalized parsing? I have used it for several things before but I am needing to reimplement a proprietary binary protocol stored in a buffer. Best I can tell, the library will just work assuming the introduction of a few byte oriented primitives I have written. But the constraint to strings specifically stops me from using an unforked version currently.

anko commented 6 years ago

Use .toString()?

Is your use-case really significantly optimised by avoiding the conversion?

theqabalist commented 6 years ago

Using toString seems like a completely inappropriate approach.

new Buffer([0x1, 0x2, 0x3]).toString()
//yields '\u0001\u0002\u0003'

is not something that is going to yield an easily matchable string. Additionally, while the data stream is byte aligned, it is bit packed. I need to be able to match and work against Numbers/bytes, not against strings where I would have to convert them back to bytes or Numbers to do bitwise operations.

In my specific use case, I have already written an example parser using attoparsec in haskell, but that was for my own edification. I realized then that Parsimmon doesn't allow for working against binary data using the mechanisms built into node. The use case for monadic parser is that the binary payload switches after the control byte, which needs to be masked against 0x10, 0x20, and 0x40. On one of these, if they match, there are two other control bits being used to determine further shape.

wavebeem commented 6 years ago

If JavaScript had UTF-8 strings it probably wouldn't be a big deal to just pretend the buffer is a string, but given that JS strings are UTF-16, the String class is really gonna work against you by providing "char codes" back rather than the underlying bytes:

> b = new Buffer([0x01, 0xff, 0x88, 0xf0]);
<Buffer 01 ff 88 f0>
> s = b.toString();
'\u0001���'
> for (i = 0; i < s.length; i++) {
...   console.log(s.charCodeAt(i))
... }
1
65533
65533
65533

Have you looked into how much of Parsimmon needs to be changed to provide basic support for Buffers? I've never really done much with Buffers before. I know currently I added a type check against String in .parse to help people avoid errors, so that would need to be modified at least.

Buffer supports .slice but does not support .charAt so that would need to change at least. Not to mention a revisit of the docs, and maybe some messing with assumptions in the library about the input being a string?

You mentioned a fork. Do you have a fork that adds support for this? I'd be interested in seeing the diff.

theqabalist commented 6 years ago

I forked it today and basically just increased the type check to include buffer. Best I can tell, that's all you really need if you're willing to create some rudimentary parsers like anyByte or byte, etc. I didn't actually make an entire parser with it, but I did verify I was able to do something like consume bytes from the buffer if the first byte indicates length.

To make the library treat byte or binary parsing as a first class citizen, though, would mean providing these primitive constructors in a way that doesn't confuse them with the string oriented constructors. Hence why I tried to make a suggestion to just loosen the constraint vs asking to "support buffers" fully.

One other question that I'm unsure of is things like .times return an array of bytes rather than a buffer, which I am unclear what the larger scale ramifications of that are.

wavebeem commented 6 years ago

seq(a, b, c).map(Buffer.from) will turn the array of bytes back into a Buffer. But it might be worth having a seqBuffer function.

I made a side branch where I'm exploring the idea of adding a Parsimmon.byte parser creator, and some extra stuff for error messaging and accepting Buffers.

https://github.com/jneen/parsimmon/tree/topic/buffer-support

Can you try it out and let me know if it's useful for making a parser?

Here's the diff: https://github.com/jneen/parsimmon/compare/topic/buffer-support?expand=1

theqabalist commented 6 years ago

Sure, I can pull it down and try to port my attoparsec implementation as its partial, and see what I run into.

theqabalist commented 6 years ago

So I have built a decent size parser that seems to work thus far with the changes you have made. The only thing I have needed to implement was a uintBE parser, which I implemented as follows.

const uintBE = n => new Parsimmon((input, i) => {
    if (!Buffer.isBuffer(input)) {
        throw new Error('Input is not a buffer.');
    }
    if (i + n >= input.length) {
        return Parsimmon.makeFailure(i, `${n} bytes`);
    }
    return Parsimmon.makeSuccess(i + n, input.readUIntBE(i, n));
});

It's important to note here that node is not capable of reading more than 6 bytes from a buffer into an int, so if you need a bigger integer, or whatever, you have to handle it differently (we have a 96 bit flag value, for example).

Other than that, I have been able to get by with the primitives any and all, and the combinators. One other constructor that may be of use is a byte range constructor like byteRange([0x40, 0x80]), which also came up, but I handled a different way. It can also be handled by Parsimmon.test technically I think.

wavebeem commented 6 years ago

is the source for this available? Just curious what it looks like if it is.

And yeah, I sort of figured that Parsimmon.test would cover most of what you want? Or also .chain.

const hasHighBit = Parsimmon.test(b => b & 128);

I'm just wondering what level of support would make Parsimmon actually useful for binary stream parsing. But I've never done it before so I'm looking for advice.

theqabalist commented 6 years ago

Unfortunately the both the source and the stream protocol is protected IP, so I can't really share it in toto. I can probably share parts that are of interest. It is surprising how much the changes you make just seem to allow it to work. I technically could have also written the uintBE parser combinatorially as

const uintBE = n => any.times(n).map(Buffer.from).map(invoker('readUIntBE', n));

Or something really close to that. The way I ended up solving the range thing was with chain like

const dataFrame = any.chain(val => val >= 0x40 && val < 0x80 ? of({type: [data, null]}) : fail(`Unknown frame type ${val}`));

I think because binary is lower level and therefore more freeform, a lot of the "helpful" constructors for characters and strings don't really make sense. You really are either expecting any byte, an exact byte, or a byte that conforms to some condition, which seem to be pretty much provided. If you need a binary parser to parse something that's not byte aligned, that's a more difficult problem because, at least in node, you have to track your consumption inside a byte as well as on the stream as a whole.

theqabalist commented 6 years ago

So turns out, there is an area of the stream that is not byte aligned, and I cannot figure out the best way to approach constructing a parser for it. It's GPS encoded in 9 bytes of the following:

27 bits - latitude 28 bits - longitude 6 bits - heading 4 bits - hdop 3 bits - satellite count 2 bits - hemisphere 2 bits - quality

coming up with an abstraction on top of buffer that would allow for consuming in a bit oriented mode would be truly helpful compared to just consuming at the byte level as we previously discussed. This is not a straightforward problem to solve I don't think, because, especially in the last byte, there are 3 items of interest, so you cannot consume the byte until the third item is consumed.

wavebeem commented 6 years ago

Yeah that is... certainly a lot more complicated to work with. Does Node even have any primitives for dealing with anything at the bit level?

It looks like the bits cross over byte boundaries in this data structure? Which is definitely not gonna fit with my example implementation of byte-oriented parsing.

theqabalist commented 6 years ago

The only things it has that I am aware of are in Number where you can do bit wise ops, and toString in base 2, lol, which would increase the size 16x. However, I have an idea of how you could build a meaningful bitwise helper. I'll sketch out my approach and provide it here.

wavebeem commented 6 years ago

Cool thanks

On Fri, Feb 2, 2018, 15:47 Brandon Keown notifications@github.com wrote:

The only things it has that I am aware of are in Number where you can do bit wise ops, and toString in base 2, lol, which would increase the size 16x. However, I have an idea of how you could build a meaningful bitwise helper. I'll sketch out my approach and provide it here.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/jneen/parsimmon/issues/224#issuecomment-362743446, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKyr8jVseK_3-SAX05eHr3eEa4PiNnnks5tQ56TgaJpZM4R0KwP .

theqabalist commented 6 years ago

OK, so here's what I have.

/* eslint no-bitwise: off */
const lshiftBuffer = b => Buffer.from(b.reduce((a, v, i, b) => a.concat(i === b.length - 1 ? Buffer.from([v, 0]).readUInt16BE(0) : b.readUInt16BE(i)), []).map(x => (x << 1 & 0xFFFF) >> 8));
const bitPeekBuffer = b => b[0] >> 7;
const consumeBitsFromBuffer = (n, b) => new Array(n).fill(0).reduce(({v, buf}) => ({
    v: v << 1 | bitPeekBuffer(buf),
    buf: lshiftBuffer(buf)
}), {v: 0, buf: b});

const realignBuffer = (alignments, b) => alignments.reduce(({coll, buf}, bits) => {
    const {v, buf: newBuf} = consumeBitsFromBuffer(bits, buf);
    return {
        coll: coll.concat(v),
        buf: newBuf
    };
}, {coll: [], buf: b}).coll;

//made up GPS data based on random numbers within the addressable ranges
console.log(realignBuffer([27, 28, 6, 4, 3, 2, 2], Buffer.from([0b11011110, 0b10010110, 0b10011001, 0b11011101, 0b00011110, 0b01111010, 0b00101010, 0b11011001, 0b00100010])));
//prints [ 116700366, 244268309, 27, 2, 2, 0, 2 ]

This is just concept code, you could rewrite it to integrate it, but the idea is as follows.

Even if a string of bytes contains non-byte aligned data, chances are that at some point there is a byte boundary, such as in my case with the 9 bytes. Therefore, the parsing primitive would be the realign([x, y, z]) where the sum of the input must be divisible by 8, and that's how many bytes are consumed and then reinterpreted with different bit boundaries. It is true that this isn't quite as nicely granular as the concept of just bits(n) but I think it strikes a balance between a byte oriented buffer and bit-oriented data.

To actually build this into a friendly primitive I think requires an exception if the sum of the alignments does not divide by 8, and if any of the number of bits requested is greater than 48 because 6 bytes is the largest value you can read from readUIntBE, and I think the largest representable integer in node.

wavebeem commented 6 years ago

I'm a little confused by the snippet. Is the idea to have a combinator like seq which consumes 0 to 6 bytes, but allocates certain bits to the results?

theqabalist commented 6 years ago

Sort of yes. The problem arises because of tracking bit offset in non-byte aligned data. If I have 3 bytes, but I want to read two 12-bit numbers that are packed into that 3 byte range, you have to jump through hoops as far as when to actually consume a byte and move the parser forward. The problem is amplified by more divisions in the byte, like if I wanted to read 8 3-bit numbers from the same 3 bytes.

So instead of trying to track bit offset, we establish a tradeoff which basically says, ok, you're going to read some number of bytes and reinterpret them as a different bit length unsigned integers. So it shares some structural similarities to seq except that it's more restrictive. For example, you cannot realign([5,10]) because that is 15 bits, and does not consume n complete bytes. However, the 0-6 byte limitation is on the individual bit-wise interpretation. Because the values are piped from one integer into another, you can consume 48 bits in a row before you run afoul of the javascript Number ability to represent an integer. So for example realign([49, 7]) is dangerous because it will overflow the Number.

In this way realign is more of a constructor than a combinator, because it sets up the parser to consume n bytes. It also does transformation internally, so it's not quite as flexible as a traditional constructor. This is part of the tradeoff for dealing with non-byte aligned data.

Does that make sense? It might be used like

const gpsParser = realign([27, 28, 6, 4, 3, 2, 2]).map(([lat, lon, heading, hdop, satellites, hemisphere, quality]) => ({
  lat,
  lon,
  heading,
  hdop,
  satellites,
  hemisphere,
  quality
})

in a more contrive scenario

const packedBitsParser = realign(new Array(8).fill(3)).map(reduce(add));
packedBitsParser.parse(Buffer.from([0xFF, 0xFF, 0xFF]));
//prints 56
theqabalist commented 6 years ago

A specialization of this constructor may also be useful, called flags which consumes a single byte but produces an array of 8 booleans.

const flags = realign(new Array(8).fill(1)).map(map(Boolean))
wavebeem commented 6 years ago

Oh I see. That makes sense though using .map on a big sequence like that makes it hard to tell which bits line up with which outputs.

On Mon, Feb 5, 2018, 05:52 Brandon Keown notifications@github.com wrote:

A specialization of this constructor may also be useful, called flags which consumes a single byte but produces an array of 8 booleans.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/jneen/parsimmon/issues/224#issuecomment-363090309, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKyr-keoSFrAe0fpH1N6cOCOy1o5vqXks5tRweFgaJpZM4R0KwP .

theqabalist commented 6 years ago

It's probably not unreasonable to also call it something like

labeledBits([
  ['lat', 27],
  ['lon', 28],
  ['heading', 6],
  ['hdop', 4],
  ['satellites', 3],
  ['hemisphere', 2],
  ['quality', 2]
])

And have it produce the object as an output. It's a bit more hand-holding than your traditional constructor, but not without its merits I don't think.

I think both may be useful in terms of flexibility, and you can implement labeledBits using realign fairly easily.

const labeledBits = table => realign(table.map(last)).map(zipObject(table.map(head)));
wavebeem commented 6 years ago

That would be pretty much exactly like seqObj, which would be a good fit, I think 😀

On Tue, Feb 6, 2018 at 6:23 AM Brandon Keown notifications@github.com wrote:

It's probably not unreasonable to also call it something like

labeledBits([ ['lat', 27], ['lon', 28], ['heading', 6], ['hdop', 4], ['satellites', 3], ['hemisphere', 2], ['quality', 2] ])

And have it produce the object as an output. It's a bit more hand-holding than your traditional constructor, but not without its merits I don't think.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/jneen/parsimmon/issues/224#issuecomment-363436961, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKyr0tlbgrMoIMtc_-HwQl6BWF6wxiCks5tSGB0gaJpZM4R0KwP .

theqabalist commented 6 years ago

They do look very similar. Could be bitSeq and bitSeqObj.

theqabalist commented 6 years ago

Would you like me to work on adding these to the buffer branch, or did you want to add them, or what are your feelings on that? I've looked over the code and noticed that you have no dependencies and the code style seems to be geared toward acceptability by like ES3 or 5. My style of programming, as may be evidenced, is heavily ES6+ oriented, and our discussion about Buffer oriented parsers presupposes node.

What do you think?

wavebeem commented 6 years ago

Parsimmon supports IE7 and newer browsers, along with Node.js.

Parsimmon needs to support IE7 as per the contract in the README, and I have not set up the project to use a compiler such as Babel.

If you wanted to add on top of my branch that would be fantastic. I would be happy to provide feedback as you work or in the PR you make.

I would also prefer to use ES6, and frankly drop IE7 support at some point (I haven't even been doing manual browser testing on Parsimmon), but I think adding Babel to the project would be better done as a separate task.

And yeah, I added a function to test for the existence of Buffer because this needs to work in browsers.

On Tue, Feb 6, 2018 at 12:00 PM Brandon Keown notifications@github.com wrote:

Would you like me to work on adding these to the buffer branch, or did you want to add them, or what are your feelings on that? I've looked over the code and noticed that you have no dependencies and the code style seems to be geared toward acceptability by like ES3 or 5. My style of programming, as may be evidenced, is heavily ES6+ oriented, and our discussion about Buffer oriented parsers presupposes node.

What do you think?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/jneen/parsimmon/issues/224#issuecomment-363546918, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKyrxVUqoiuuOVdQp3K4-AUw38pKD7qks5tSK9tgaJpZM4R0KwP .

theqabalist commented 6 years ago

Sorry I have let this issue languish. We ended up focusing on something that is largely ASCII parsing, so I didn't have a chance to work on this as part of my job. However, I am going to start work on it very soon, and should have a PR sometime next week (since we have to go back and do some non-byte aligned parsing now.)